Search Results - "Panickssery, Nina"

  • Showing 1 - 4 results of 4
Refine Results
  1. 1

    Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct by Ackerman, Christopher, Panickssery, Nina

    Published 02-10-2024
    “…It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we…”
    Get full text
    Journal Article
  2. 2

    Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models by Ball, Sarah, Kreuter, Frauke, Panickssery, Nina

    Published 13-06-2024
    “…Conversational large language models are trained to refuse to answer harmful questions. However, emergent jailbreaking techniques can still elicit unsafe…”
    Get full text
    Journal Article
  3. 3

    Refusal in Language Models Is Mediated by a Single Direction by Arditi, Andy, Obeso, Oscar, Syed, Aaquib, Paleka, Daniel, Panickssery, Nina, Gurnee, Wes, Nanda, Neel

    Published 17-06-2024
    “…Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful…”
    Get full text
    Journal Article
  4. 4

    Steering Llama 2 via Contrastive Activation Addition by Panickssery, Nina, Gabrieli, Nick, Schulz, Julian, Tong, Meg, Hubinger, Evan, Turner, Alexander Matt

    Published 08-12-2023
    “…We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA…”
    Get full text
    Journal Article