Search Results - "Hubinger, Evan" :: Katalog Arama

1
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant by Järviniemi, Olli, Hubinger, Evan

Published 25-04-2024
“…We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees…”

Get full text

Journal Article
QR Code
Save to List

Saved in:
2
An overview of 11 proposals for building safe advanced AI by Hubinger, Evan

Published 04-12-2020
“…This paper analyzes and compares 11 different proposals for building safe advanced AI under the current machine learning paradigm, including major contenders…”

Get full text

Journal Article
QR Code
Save to List

Saved in:
3
Engineering Monosemanticity in Toy Models by Jermyn, Adam S, Schiefer, Nicholas, Hubinger, Evan

Published 16-11-2022
“…In some neural networks, individual neurons correspond to natural ``features'' in the input. Such \emph{monosemantic} neurons are of great help in…”

Get full text

Journal Article
QR Code
Save to List

Saved in:
4
Conditioning Predictive Models: Risks and Strategies by Hubinger, Evan, Jermyn, Adam, Treutlein, Johannes, Hudson, Rubi, Woolverton, Kate

Published 01-02-2023
“…Our intention is to provide a definitive reference on what it would take to safely make use of generative/predictive models in the absence of a solution to the…”

Get full text

Journal Article
QR Code
Save to List

Saved in:
5
Steering Llama 2 via Contrastive Activation Addition by Panickssery, Nina, Gabrieli, Nick, Schulz, Julian, Tong, Meg, Hubinger, Evan, Turner, Alexander Matt

Published 08-12-2023
“…We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA…”

Get full text

Journal Article
QR Code
Save to List

Saved in:
6
Sabotage Evaluations for Frontier Models by Benton, Joe, Wagner, Misha, Christiansen, Eric, Anil, Cem, Perez, Ethan, Srivastav, Jai, Durmus, Esin, Ganguli, Deep, Kravec, Shauna, Shlegeris, Buck, Kaplan, Jared, Karnofsky, Holden, Hubinger, Evan, Grosse, Roger, Bowman, Samuel R, Duvenaud, David

Published 28-10-2024
“…Sufficiently capable models could subvert human oversight and decision-making in important contexts. For example, in the context of AI development, models…”

Get full text

Journal Article
QR Code
Save to List

Saved in:
7
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models by Denison, Carson, MacDiarmid, Monte, Barez, Fazl, Duvenaud, David, Kravec, Shauna, Marks, Samuel, Schiefer, Nicholas, Soklaski, Ryan, Tamkin, Alex, Kaplan, Jared, Shlegeris, Buck, Bowman, Samuel R, Perez, Ethan, Hubinger, Evan

Published 14-06-2024
“…In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals…”

Get full text

Journal Article
QR Code
Save to List

Saved in:
8
Studying Large Language Model Generalization with Influence Functions by Grosse, Roger, Bae, Juhan, Anil, Cem, Elhage, Nelson, Tamkin, Alex, Tajdini, Amirhossein, Steiner, Benoit, Li, Dustin, Durmus, Esin, Perez, Ethan, Hubinger, Evan, Lukošiūtė, Kamilė, Nguyen, Karina, Joseph, Nicholas, McCandlish, Sam, Kaplan, Jared, Bowman, Samuel R

Published 07-08-2023
“…When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of…”

Get full text

Journal Article
QR Code
Save to List

Saved in:
9
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training by Hubinger, Evan, Denison, Carson, Mu, Jesse, Lambert, Mike, Tong, Meg, MacDiarmid, Monte, Lanham, Tamera, Ziegler, Daniel M, Maxwell, Tim, Cheng, Newton, Jermyn, Adam, Askell, Amanda, Radhakrishnan, Ansh, Anil, Cem, Duvenaud, David, Ganguli, Deep, Barez, Fazl, Clark, Jack, Ndousse, Kamal, Sachan, Kshitij, Sellitto, Michael, Sharma, Mrinank, DasSarma, Nova, Grosse, Roger, Kravec, Shauna, Bai, Yuntao, Witten, Zachary, Favaro, Marina, Brauner, Jan, Karnofsky, Holden, Christiano, Paul, Bowman, Samuel R, Graham, Logan, Kaplan, Jared, Mindermann, Sören, Greenblatt, Ryan, Shlegeris, Buck, Schiefer, Nicholas, Perez, Ethan

Published 10-01-2024
“…Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue…”

Get full text

Journal Article
QR Code
Save to List

Saved in:
10
Measuring Faithfulness in Chain-of-Thought Reasoning by Lanham, Tamera, Chen, Anna, Radhakrishnan, Ansh, Steiner, Benoit, Denison, Carson, Hernandez, Danny, Li, Dustin, Durmus, Esin, Hubinger, Evan, Kernion, Jackson, Lukošiūtė, Kamilė, Nguyen, Karina, Cheng, Newton, Joseph, Nicholas, Schiefer, Nicholas, Rausch, Oliver, Larson, Robin, McCandlish, Sam, Kundu, Sandipan, Kadavath, Saurav, Yang, Shannon, Henighan, Thomas, Maxwell, Timothy, Telleen-Lawton, Timothy, Hume, Tristan, Hatfield-Dodds, Zac, Kaplan, Jared, Brauner, Jan, Bowman, Samuel R, Perez, Ethan

Published 16-07-2023
“…Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear…”

Get full text

Journal Article
QR Code
Save to List

Saved in:
11
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning by Radhakrishnan, Ansh, Nguyen, Karina, Chen, Anna, Chen, Carol, Denison, Carson, Hernandez, Danny, Durmus, Esin, Hubinger, Evan, Kernion, Jackson, Lukošiūtė, Kamilė, Cheng, Newton, Joseph, Nicholas, Schiefer, Nicholas, Rausch, Oliver, McCandlish, Sam, Showk, Sheer El, Lanham, Tamera, Maxwell, Tim, Chandrasekaran, Venkatesa, Hatfield-Dodds, Zac, Kaplan, Jared, Brauner, Jan, Bowman, Samuel R, Perez, Ethan

Published 16-07-2023
“…As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help…”

Get full text

Journal Article
QR Code
Save to List

Saved in:
12
Risks from Learned Optimization in Advanced Machine Learning Systems by Hubinger, Evan, van Merwijk, Chris, Mikulik, Vladimir, Skalse, Joar, Garrabrant, Scott

Published 05-06-2019
“…We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer - a situation we refer to as…”

Get full text

Journal Article
QR Code
Save to List

Saved in:
13
Discovering Language Model Behaviors with Model-Written Evaluations by Perez, Ethan, Ringer, Sam, Lukošiūtė, Kamilė, Nguyen, Karina, Chen, Edwin, Heiner, Scott, Pettit, Craig, Olsson, Catherine, Kundu, Sandipan, Kadavath, Saurav, Jones, Andy, Chen, Anna, Mann, Ben, Israel, Brian, Seethor, Bryan, McKinnon, Cameron, Olah, Christopher, Yan, Da, Amodei, Daniela, Amodei, Dario, Drain, Dawn, Li, Dustin, Tran-Johnson, Eli, Khundadze, Guro, Kernion, Jackson, Landis, James, Kerr, Jamie, Mueller, Jared, Hyun, Jeeyoon, Landau, Joshua, Ndousse, Kamal, Goldberg, Landon, Lovitt, Liane, Lucas, Martin, Sellitto, Michael, Zhang, Miranda, Kingsland, Neerav, Elhage, Nelson, Joseph, Nicholas, Mercado, Noemí, DasSarma, Nova, Rausch, Oliver, Larson, Robin, McCandlish, Sam, Johnston, Scott, Kravec, Shauna, Showk, Sheer El, Lanham, Tamera, Telleen-Lawton, Timothy, Brown, Tom, Henighan, Tom, Hume, Tristan, Bai, Yuntao, Hatfield-Dodds, Zac, Clark, Jack, Bowman, Samuel R, Askell, Amanda, Grosse, Roger, Hernandez, Danny, Ganguli, Deep, Hubinger, Evan, Schiefer, Nicholas, Kaplan, Jared

Published 19-12-2022
“…As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates…”

Get full text

Journal Article
QR Code
Save to List

Saved in: