Sabotage Evaluations for Frontier Models
Sufficiently capable models could subvert human oversight and decision-making in important contexts. For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deploym...
Saved in:
Main Authors: | , , , , , , , , , , , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
28-10-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Sufficiently capable models could subvert human oversight and decision-making
in important contexts. For example, in the context of AI development, models
could covertly sabotage efforts to evaluate their own dangerous capabilities,
to monitor their behavior, or to make decisions about their deployment. We
refer to this family of abilities as sabotage capabilities. We develop a set of
related threat models and evaluations. These evaluations are designed to
provide evidence that a given model, operating under a given set of
mitigations, could not successfully sabotage a frontier model developer or
other large organization's activities in any of these ways. We demonstrate
these evaluations on Anthropic's Claude 3 Opus and Claude 3.5 Sonnet models.
Our results suggest that for these models, minimal mitigations are currently
sufficient to address sabotage risks, but that more realistic evaluations and
stronger mitigations seem likely to be necessary soon as capabilities improve.
We also survey related evaluations we tried and abandoned. Finally, we discuss
the advantages of mitigation-aware capability evaluations, and of simulating
large-scale deployments using small-scale statistics. |
---|---|
DOI: | 10.48550/arxiv.2410.21514 |