Social Contract AI: Aligning AI Assistants with Implicit Group Norms
We explore the idea of aligning an AI assistant by inverting a model of users' (unknown) preferences from observed interactions. To validate our proposal, we run proof-of-concept simulations in the economic ultimatum game, formalizing user preferences as policies that guide the actions of simul...
Saved in:
Main Authors: | , , , , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
26-10-2023
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | We explore the idea of aligning an AI assistant by inverting a model of
users' (unknown) preferences from observed interactions. To validate our
proposal, we run proof-of-concept simulations in the economic ultimatum game,
formalizing user preferences as policies that guide the actions of simulated
players. We find that the AI assistant accurately aligns its behavior to match
standard policies from the economic literature (e.g., selfish, altruistic).
However, the assistant's learned policies lack robustness and exhibit limited
generalization in an out-of-distribution setting when confronted with a
currency (e.g., grams of medicine) that was not included in the assistant's
training distribution. Additionally, we find that when there is inconsistency
in the relationship between language use and an unknown policy (e.g., an
altruistic policy combined with rude language), the assistant's learning of the
policy is slowed. Overall, our preliminary results suggest that developing
simulation frameworks in which AI assistants need to infer preferences from
diverse users can provide a valuable approach for studying practical alignment
questions. |
---|---|
DOI: | 10.48550/arxiv.2310.17769 |