Search Results - "Panickssery, Nina"
-
1
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct
Published 02-10-2024“…It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we…”
Get full text
Journal Article -
2
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
Published 13-06-2024“…Conversational large language models are trained to refuse to answer harmful questions. However, emergent jailbreaking techniques can still elicit unsafe…”
Get full text
Journal Article -
3
Refusal in Language Models Is Mediated by a Single Direction
Published 17-06-2024“…Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful…”
Get full text
Journal Article -
4
Steering Llama 2 via Contrastive Activation Addition
Published 08-12-2023“…We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA…”
Get full text
Journal Article