Search Results - "Hubinger, Evan"

  • Showing 1 - 13 results of 13
Refine Results
  1. 1

    Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant by Järviniemi, Olli, Hubinger, Evan

    Published 25-04-2024
    “…We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees…”
    Get full text
    Journal Article
  2. 2

    An overview of 11 proposals for building safe advanced AI by Hubinger, Evan

    Published 04-12-2020
    “…This paper analyzes and compares 11 different proposals for building safe advanced AI under the current machine learning paradigm, including major contenders…”
    Get full text
    Journal Article
  3. 3

    Engineering Monosemanticity in Toy Models by Jermyn, Adam S, Schiefer, Nicholas, Hubinger, Evan

    Published 16-11-2022
    “…In some neural networks, individual neurons correspond to natural ``features'' in the input. Such \emph{monosemantic} neurons are of great help in…”
    Get full text
    Journal Article
  4. 4

    Conditioning Predictive Models: Risks and Strategies by Hubinger, Evan, Jermyn, Adam, Treutlein, Johannes, Hudson, Rubi, Woolverton, Kate

    Published 01-02-2023
    “…Our intention is to provide a definitive reference on what it would take to safely make use of generative/predictive models in the absence of a solution to the…”
    Get full text
    Journal Article
  5. 5

    Steering Llama 2 via Contrastive Activation Addition by Panickssery, Nina, Gabrieli, Nick, Schulz, Julian, Tong, Meg, Hubinger, Evan, Turner, Alexander Matt

    Published 08-12-2023
    “…We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA…”
    Get full text
    Journal Article
  6. 6

    Sabotage Evaluations for Frontier Models by Benton, Joe, Wagner, Misha, Christiansen, Eric, Anil, Cem, Perez, Ethan, Srivastav, Jai, Durmus, Esin, Ganguli, Deep, Kravec, Shauna, Shlegeris, Buck, Kaplan, Jared, Karnofsky, Holden, Hubinger, Evan, Grosse, Roger, Bowman, Samuel R, Duvenaud, David

    Published 28-10-2024
    “…Sufficiently capable models could subvert human oversight and decision-making in important contexts. For example, in the context of AI development, models…”
    Get full text
    Journal Article
  7. 7

    Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models by Denison, Carson, MacDiarmid, Monte, Barez, Fazl, Duvenaud, David, Kravec, Shauna, Marks, Samuel, Schiefer, Nicholas, Soklaski, Ryan, Tamkin, Alex, Kaplan, Jared, Shlegeris, Buck, Bowman, Samuel R, Perez, Ethan, Hubinger, Evan

    Published 14-06-2024
    “…In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals…”
    Get full text
    Journal Article
  8. 8

    Studying Large Language Model Generalization with Influence Functions by Grosse, Roger, Bae, Juhan, Anil, Cem, Elhage, Nelson, Tamkin, Alex, Tajdini, Amirhossein, Steiner, Benoit, Li, Dustin, Durmus, Esin, Perez, Ethan, Hubinger, Evan, Lukošiūtė, Kamilė, Nguyen, Karina, Joseph, Nicholas, McCandlish, Sam, Kaplan, Jared, Bowman, Samuel R

    Published 07-08-2023
    “…When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of…”
    Get full text
    Journal Article
  9. 9
  10. 10
  11. 11
  12. 12

    Risks from Learned Optimization in Advanced Machine Learning Systems by Hubinger, Evan, van Merwijk, Chris, Mikulik, Vladimir, Skalse, Joar, Garrabrant, Scott

    Published 05-06-2019
    “…We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer - a situation we refer to as…”
    Get full text
    Journal Article
  13. 13