Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classi...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
23-10-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Abstract | The dominant paradigm for RLHF is online and on-policy RL: synchronously
generating from the large language model (LLM) policy, labelling with a reward
model, and learning using feedback on the LLM's own outputs. While performant,
this paradigm is computationally inefficient. Inspired by classical deep RL
literature, we propose separating generation and learning in RLHF. This enables
asynchronous generation of new samples while simultaneously training on old
samples, leading to faster training and more compute-optimal scaling. However,
asynchronous training relies on an underexplored regime, online but off-policy
RLHF: learning on samples from previous iterations of our model. To understand
the challenges in this regime, we investigate a fundamental question: how much
off-policyness can we tolerate for asynchronous training to speed up learning
but maintain performance? Among several RLHF algorithms we tested, we find that
online DPO is most robust to off-policy data, and robustness increases with the
scale of the policy model. We study further compute optimizations for
asynchronous RLHF but find that they come at a performance cost, giving rise to
a trade-off. Finally, we verify the scalability of asynchronous RLHF by
training LLaMA 3.1 8B on an instruction-following task 40% faster than a
synchronous run while matching final performance. |
---|---|
AbstractList | The dominant paradigm for RLHF is online and on-policy RL: synchronously
generating from the large language model (LLM) policy, labelling with a reward
model, and learning using feedback on the LLM's own outputs. While performant,
this paradigm is computationally inefficient. Inspired by classical deep RL
literature, we propose separating generation and learning in RLHF. This enables
asynchronous generation of new samples while simultaneously training on old
samples, leading to faster training and more compute-optimal scaling. However,
asynchronous training relies on an underexplored regime, online but off-policy
RLHF: learning on samples from previous iterations of our model. To understand
the challenges in this regime, we investigate a fundamental question: how much
off-policyness can we tolerate for asynchronous training to speed up learning
but maintain performance? Among several RLHF algorithms we tested, we find that
online DPO is most robust to off-policy data, and robustness increases with the
scale of the policy model. We study further compute optimizations for
asynchronous RLHF but find that they come at a performance cost, giving rise to
a trade-off. Finally, we verify the scalability of asynchronous RLHF by
training LLaMA 3.1 8B on an instruction-following task 40% faster than a
synchronous run while matching final performance. |
Author | Courville, Aaron Noukhovitch, Michael Huang, Shengyi Hosseini, Arian Agarwal, Rishabh Xhonneux, Sophie |
Author_xml | – sequence: 1 givenname: Michael surname: Noukhovitch fullname: Noukhovitch, Michael – sequence: 2 givenname: Shengyi surname: Huang fullname: Huang, Shengyi – sequence: 3 givenname: Sophie surname: Xhonneux fullname: Xhonneux, Sophie – sequence: 4 givenname: Arian surname: Hosseini fullname: Hosseini, Arian – sequence: 5 givenname: Rishabh surname: Agarwal fullname: Agarwal, Rishabh – sequence: 6 givenname: Aaron surname: Courville fullname: Courville, Aaron |
BackLink | https://doi.org/10.48550/arXiv.2410.18252$$DView paper in arXiv |
BookMark | eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIBChhaGJkacTL4ORZX5iVnFOXn5ZcWKwT5eLhZKbglFpekFikk5qUo-OYXpSq4pqVlJmem5pUo-Kel6Qbk52QmVwKVKqTlFyn4JOallyampwJVpqTmFPMwsKYl5hSn8kJpbgZ5N9cQZw9dsM3xBUWZuYlFlfEgF8SDXWBMWAUAGUk8Gg |
ContentType | Journal Article |
Copyright | http://creativecommons.org/licenses/by-nc-nd/4.0 |
Copyright_xml | – notice: http://creativecommons.org/licenses/by-nc-nd/4.0 |
DBID | AKY GOX |
DOI | 10.48550/arxiv.2410.18252 |
DatabaseName | arXiv Computer Science arXiv.org |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
ExternalDocumentID | 2410_18252 |
GroupedDBID | AKY GOX |
ID | FETCH-arxiv_primary_2410_182523 |
IEDL.DBID | GOX |
IngestDate | Sat Oct 26 12:24:57 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-arxiv_primary_2410_182523 |
OpenAccessLink | https://arxiv.org/abs/2410.18252 |
ParticipantIDs | arxiv_primary_2410_18252 |
PublicationCentury | 2000 |
PublicationDate | 2024-10-23 |
PublicationDateYYYYMMDD | 2024-10-23 |
PublicationDate_xml | – month: 10 year: 2024 text: 2024-10-23 day: 23 |
PublicationDecade | 2020 |
PublicationYear | 2024 |
Score | 3.8785737 |
SecondaryResourceType | preprint |
Snippet | The dominant paradigm for RLHF is online and on-policy RL: synchronously
generating from the large language model (LLM) policy, labelling with a reward
model,... |
SourceID | arxiv |
SourceType | Open Access Repository |
SubjectTerms | Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Learning |
Title | Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models |
URI | https://arxiv.org/abs/2410.18252 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQMbI0AZ0yZ6ALbMqb6pqkmSfpJpmZANNyWpJ5cpKpaWIy-Po2j2BzvwgLF1fQMTkKsL0wiUUVmWWQ84GTivWB1QswUwM7McBCltnICLRky90_AjI5CT6KC6oeoQ7YxgQLIVUSboIM_NDWnYIjJDqEGJhS80QY_ByLK_OSQYfQAnvZCkE-Hm5WCm6JoAMKFIDdeAXf_KJUBVfwSQ7ACkDBPy1NF3JYL1CpArBJqeADHVJUAN1bllMsyiDv5hri7KELdkF8AeS4iHiQ4-LBjjMWY2ABdupTJRgULE0MzVNM04D9HdDtz8ZJFklmSRaWBslJxhYGicYphpIMErhMkcItJc3AZQSsdEFlq5GxDANLSVFpqiwDc3FKqRw45AAfNG7N |
link.rule.ids | 228,230,782,887 |
linkProvider | Cornell University |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Asynchronous+RLHF%3A+Faster+and+More+Efficient+Off-Policy+RL+for+Language+Models&rft.au=Noukhovitch%2C+Michael&rft.au=Huang%2C+Shengyi&rft.au=Xhonneux%2C+Sophie&rft.au=Hosseini%2C+Arian&rft.date=2024-10-23&rft_id=info:doi/10.48550%2Farxiv.2410.18252&rft.externalDocID=2410_18252 |