Search Results - "Hubinger, Evan"
-
1
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Published 25-04-2024“…We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees…”
Get full text
Journal Article -
2
An overview of 11 proposals for building safe advanced AI
Published 04-12-2020“…This paper analyzes and compares 11 different proposals for building safe advanced AI under the current machine learning paradigm, including major contenders…”
Get full text
Journal Article -
3
Engineering Monosemanticity in Toy Models
Published 16-11-2022“…In some neural networks, individual neurons correspond to natural ``features'' in the input. Such \emph{monosemantic} neurons are of great help in…”
Get full text
Journal Article -
4
Conditioning Predictive Models: Risks and Strategies
Published 01-02-2023“…Our intention is to provide a definitive reference on what it would take to safely make use of generative/predictive models in the absence of a solution to the…”
Get full text
Journal Article -
5
Steering Llama 2 via Contrastive Activation Addition
Published 08-12-2023“…We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA…”
Get full text
Journal Article -
6
Sabotage Evaluations for Frontier Models
Published 28-10-2024“…Sufficiently capable models could subvert human oversight and decision-making in important contexts. For example, in the context of AI development, models…”
Get full text
Journal Article -
7
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Published 14-06-2024“…In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals…”
Get full text
Journal Article -
8
Studying Large Language Model Generalization with Influence Functions
Published 07-08-2023“…When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of…”
Get full text
Journal Article -
9
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Published 10-01-2024“…Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue…”
Get full text
Journal Article -
10
Measuring Faithfulness in Chain-of-Thought Reasoning
Published 16-07-2023“…Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear…”
Get full text
Journal Article -
11
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Published 16-07-2023“…As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help…”
Get full text
Journal Article -
12
Risks from Learned Optimization in Advanced Machine Learning Systems
Published 05-06-2019“…We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer - a situation we refer to as…”
Get full text
Journal Article -
13
Discovering Language Model Behaviors with Model-Written Evaluations
Published 19-12-2022“…As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates…”
Get full text
Journal Article