Search Results - "Nanda, Neel"

Refine Results
  1. 1

    How to use and interpret activation patching by Heimersheim, Stefan, Nanda, Neel

    Published 23-04-2024
    “…Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the…”
    Get full text
    Journal Article
  2. 2

    Explorations of Self-Repair in Language Models by Rushing, Cody, Nanda, Neel

    Published 23-02-2024
    “…Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language…”
    Get full text
    Journal Article
  3. 3

    Towards Best Practices of Activation Patching in Language Models: Metrics and Methods by Zhang, Fred, Nanda, Neel

    Published 27-09-2023
    “…Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model…”
    Get full text
    Journal Article
  4. 4

    Transcoders Find Interpretable LLM Feature Circuits by Dunefsky, Jacob, Chlenski, Philippe, Nanda, Neel

    Published 17-06-2024
    “…A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities…”
    Get full text
    Journal Article
  5. 5

    Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control by Makelov, Aleksandar, Lange, George, Nanda, Neel

    Published 14-05-2024
    “…Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in…”
    Get full text
    Journal Article
  6. 6

    Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs by Chughtai, Bilal, Cooney, Alan, Nanda, Neel

    Published 11-02-2024
    “…How do transformer-based large language models (LLMs) store and retrieve knowledge? We focus on the most basic form of this task -- factual recall, where the…”
    Get full text
    Journal Article
  7. 7

    Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching by Makelov, Aleksandar, Lange, Georg, Nanda, Neel

    Published 28-11-2023
    “…Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional…”
    Get full text
    Journal Article
  8. 8

    Emergent Linear Representations in World Models of Self-Supervised Sequence Models by Nanda, Neel, Lee, Andrew, Wattenberg, Martin

    Published 02-09-2023
    “…How do sequence models represent their decision-making process? Prior work suggests that Othello-playing neural network learned nonlinear models of the board…”
    Get full text
    Journal Article
  9. 9

    A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations by Chughtai, Bilal, Chan, Lawrence, Nanda, Neel

    Published 06-02-2023
    “…Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks. In…”
    Get full text
    Journal Article
  10. 10

    AtP: An efficient and scalable method for localizing LLM behaviour to components by Kramár, János, Lieberum, Tom, Shah, Rohin, Nanda, Neel

    Published 01-03-2024
    “…Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep…”
    Get full text
    Journal Article
  11. 11

    Training Dynamics of Contextual N-Grams in Language Models by Quirke, Lucia, Heindrich, Lovis, Gurnee, Wes, Nanda, Neel

    Published 01-11-2023
    “…Prior work has shown the existence of contextual neurons in language models, including a neuron that activates on German text. We show that this neuron exists…”
    Get full text
    Journal Article
  12. 12

    Linear Representations of Sentiment in Large Language Models by Tigges, Curt, Hollinsworth, Oskar John, Geiger, Atticus, Nanda, Neel

    Published 23-10-2023
    “…Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this…”
    Get full text
    Journal Article
  13. 13

    Copy Suppression: Comprehensively Understanding an Attention Head by McDougall, Callum, Conmy, Arthur, Rushing, Cody, McGrath, Thomas, Nanda, Neel

    Published 06-10-2023
    “…We present a single attention head in GPT-2 Small that has one main role across the entire training distribution. If components in earlier layers predict a…”
    Get full text
    Journal Article
  14. 14

    An Empirical Investigation of Learning from Biased Toxicity Labels by Nanda, Neel, Uesato, Jonathan, Gowal, Sven

    Published 04-10-2021
    “…Collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels. As…”
    Get full text
    Journal Article
  15. 15

    N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models by Foote, Alex, Nanda, Neel, Kran, Esben, Konstas, Ionnis, Barez, Fazl

    Published 22-04-2023
    “…Understanding the function of individual neurons within language models is essential for mechanistic interpretability research. We propose $\textbf{Neuron to…”
    Get full text
    Journal Article
  16. 16

    Progress measures for grokking via mechanistic interpretability by Nanda, Neel, Chan, Lawrence, Lieberum, Tom, Smith, Jess, Steinhardt, Jacob

    Published 12-01-2023
    “…Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or…”
    Get full text
    Journal Article
  17. 17

    Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders by Rajamanoharan, Senthooran, Lieberum, Tom, Sonnerat, Nicolas, Conmy, Arthur, Varma, Vikrant, Kramár, János, Nanda, Neel

    Published 19-07-2024
    “…Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM)…”
    Get full text
    Journal Article
  18. 18

    Fully General Online Imitation Learning by Cohen, Michael K, Hutter, Marcus, Nanda, Neel

    Published 17-02-2021
    “…In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we…”
    Get full text
    Journal Article
  19. 19

    Interpreting Attention Layer Outputs with Sparse Autoencoders by Kissane, Connor, Krzyzanowski, Robert, Bloom, Joseph Isaac, Conmy, Arthur, Nanda, Neel

    Published 25-06-2024
    “…Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular…”
    Get full text
    Journal Article
  20. 20

    Confidence Regulation Neurons in Language Models by Stolfo, Alessandro, Wu, Ben, Gurnee, Wes, Belinkov, Yonatan, Song, Xingyi, Sachan, Mrinmaya, Nanda, Neel

    Published 23-06-2024
    “…Despite their widespread use, the mechanisms by which large language models (LLMs) represent and regulate uncertainty in next-token predictions remain largely…”
    Get full text
    Journal Article