Search Results - "Nanda, Neel"
-
1
How to use and interpret activation patching
Published 23-04-2024“…Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the…”
Get full text
Journal Article -
2
Explorations of Self-Repair in Language Models
Published 23-02-2024“…Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language…”
Get full text
Journal Article -
3
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Published 27-09-2023“…Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model…”
Get full text
Journal Article -
4
Transcoders Find Interpretable LLM Feature Circuits
Published 17-06-2024“…A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities…”
Get full text
Journal Article -
5
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Published 14-05-2024“…Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in…”
Get full text
Journal Article -
6
Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs
Published 11-02-2024“…How do transformer-based large language models (LLMs) store and retrieve knowledge? We focus on the most basic form of this task -- factual recall, where the…”
Get full text
Journal Article -
7
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching
Published 28-11-2023“…Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional…”
Get full text
Journal Article -
8
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
Published 02-09-2023“…How do sequence models represent their decision-making process? Prior work suggests that Othello-playing neural network learned nonlinear models of the board…”
Get full text
Journal Article -
9
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
Published 06-02-2023“…Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks. In…”
Get full text
Journal Article -
10
AtP: An efficient and scalable method for localizing LLM behaviour to components
Published 01-03-2024“…Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep…”
Get full text
Journal Article -
11
Training Dynamics of Contextual N-Grams in Language Models
Published 01-11-2023“…Prior work has shown the existence of contextual neurons in language models, including a neuron that activates on German text. We show that this neuron exists…”
Get full text
Journal Article -
12
Linear Representations of Sentiment in Large Language Models
Published 23-10-2023“…Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this…”
Get full text
Journal Article -
13
Copy Suppression: Comprehensively Understanding an Attention Head
Published 06-10-2023“…We present a single attention head in GPT-2 Small that has one main role across the entire training distribution. If components in earlier layers predict a…”
Get full text
Journal Article -
14
An Empirical Investigation of Learning from Biased Toxicity Labels
Published 04-10-2021“…Collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels. As…”
Get full text
Journal Article -
15
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models
Published 22-04-2023“…Understanding the function of individual neurons within language models is essential for mechanistic interpretability research. We propose $\textbf{Neuron to…”
Get full text
Journal Article -
16
Progress measures for grokking via mechanistic interpretability
Published 12-01-2023“…Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or…”
Get full text
Journal Article -
17
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Published 19-07-2024“…Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM)…”
Get full text
Journal Article -
18
Fully General Online Imitation Learning
Published 17-02-2021“…In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we…”
Get full text
Journal Article -
19
Interpreting Attention Layer Outputs with Sparse Autoencoders
Published 25-06-2024“…Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular…”
Get full text
Journal Article -
20
Confidence Regulation Neurons in Language Models
Published 23-06-2024“…Despite their widespread use, the mechanisms by which large language models (LLMs) represent and regulate uncertainty in next-token predictions remain largely…”
Get full text
Journal Article