Search Results - "Panickssery, Nina"

1
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct by Ackerman, Christopher, Panickssery, Nina

Published 02-10-2024
“…It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we…”

Get full text

Journal Article
QR Code
Save to List

Saved in:
2
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models by Ball, Sarah, Kreuter, Frauke, Panickssery, Nina

Published 13-06-2024
“…Conversational large language models are trained to refuse to answer harmful questions. However, emergent jailbreaking techniques can still elicit unsafe…”

Get full text

Journal Article
QR Code
Save to List

Saved in:
3
Refusal in Language Models Is Mediated by a Single Direction by Arditi, Andy, Obeso, Oscar, Syed, Aaquib, Paleka, Daniel, Panickssery, Nina, Gurnee, Wes, Nanda, Neel

Published 17-06-2024
“…Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful…”

Get full text

Journal Article
QR Code
Save to List

Saved in:
4
Steering Llama 2 via Contrastive Activation Addition by Panickssery, Nina, Gabrieli, Nick, Schulz, Julian, Tong, Meg, Hubinger, Evan, Turner, Alexander Matt

Published 08-12-2023
“…We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA…”

Get full text

Journal Article
QR Code
Save to List

Saved in:

Search Results - "Panickssery, Nina"

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct by Ackerman, Christopher, Panickssery, Nina

Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models by Ball, Sarah, Kreuter, Frauke, Panickssery, Nina

Refusal in Language Models Is Mediated by a Single Direction by Arditi, Andy, Obeso, Oscar, Syed, Aaquib, Paleka, Daniel, Panickssery, Nina, Gurnee, Wes, Nanda, Neel

Steering Llama 2 via Contrastive Activation Addition by Panickssery, Nina, Gabrieli, Nick, Schulz, Julian, Tong, Meg, Hubinger, Evan, Turner, Alexander Matt

Search Tools:

Refine Results

Format

Topic

Language

Year of Publication