Evaluating and Mitigating Discrimination in Language Model Decisions
As language models (LMs) advance, interest is growing in applying them to high-stakes societal decisions, such as determining financing or housing eligibility. However, their potential for discrimination in such contexts raises ethical concerns, motivating the need for better methods to evaluate the...
Saved in:
Main Authors: | , , , , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
06-12-2023
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | As language models (LMs) advance, interest is growing in applying them to
high-stakes societal decisions, such as determining financing or housing
eligibility. However, their potential for discrimination in such contexts
raises ethical concerns, motivating the need for better methods to evaluate
these risks. We present a method for proactively evaluating the potential
discriminatory impact of LMs in a wide range of use cases, including
hypothetical use cases where they have not yet been deployed. Specifically, we
use an LM to generate a wide array of potential prompts that decision-makers
may input into an LM, spanning 70 diverse decision scenarios across society,
and systematically vary the demographic information in each prompt. Applying
this methodology reveals patterns of both positive and negative discrimination
in the Claude 2.0 model in select settings when no interventions are applied.
While we do not endorse or permit the use of language models to make automated
decisions for the high-risk use cases we study, we demonstrate techniques to
significantly decrease both positive and negative discrimination through
careful prompt engineering, providing pathways toward safer deployment in use
cases where they may be appropriate. Our work enables developers and
policymakers to anticipate, measure, and address discrimination as language
model capabilities and applications continue to expand. We release our dataset
and prompts at https://huggingface.co/datasets/Anthropic/discrim-eval |
---|---|
DOI: | 10.48550/arxiv.2312.03689 |