Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Noukhovitch, Michael, Huang, Shengyi, Xhonneux, Sophie, Hosseini, Arian, Agarwal, Rishabh, Courville, Aaron
Format:	Journal Article
Language:	English
Published:	23-10-2024
Subjects:	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Learning
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.
AbstractList	The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.
Author	Courville, Aaron Noukhovitch, Michael Huang, Shengyi Hosseini, Arian Agarwal, Rishabh Xhonneux, Sophie
Author_xml	– sequence: 1 givenname: Michael surname: Noukhovitch fullname: Noukhovitch, Michael – sequence: 2 givenname: Shengyi surname: Huang fullname: Huang, Shengyi – sequence: 3 givenname: Sophie surname: Xhonneux fullname: Xhonneux, Sophie – sequence: 4 givenname: Arian surname: Hosseini fullname: Hosseini, Arian – sequence: 5 givenname: Rishabh surname: Agarwal fullname: Agarwal, Rishabh – sequence: 6 givenname: Aaron surname: Courville fullname: Courville, Aaron
BackLink	https://doi.org/10.48550/arXiv.2410.18252$$DView paper in arXiv
BookMark	eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIBChhaGJkacTL4ORZX5iVnFOXn5ZcWKwT5eLhZKbglFpekFikk5qUo-OYXpSq4pqVlJmem5pUo-Kel6Qbk52QmVwKVKqTlFyn4JOallyampwJVpqTmFPMwsKYl5hSn8kJpbgZ5N9cQZw9dsM3xBUWZuYlFlfEgF8SDXWBMWAUAGUk8Gg
ContentType	Journal Article
Copyright	http://creativecommons.org/licenses/by-nc-nd/4.0
Copyright_xml	– notice: http://creativecommons.org/licenses/by-nc-nd/4.0
DBID	AKY GOX
DOI	10.48550/arxiv.2410.18252
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2410_18252
GroupedDBID	AKY GOX
ID	FETCH-arxiv_primary_2410_182523
IEDL.DBID	GOX
IngestDate	Sat Oct 26 12:24:57 EDT 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2410_182523
OpenAccessLink	https://arxiv.org/abs/2410.18252
ParticipantIDs	arxiv_primary_2410_18252
PublicationCentury	2000
PublicationDate	2024-10-23
PublicationDateYYYYMMDD	2024-10-23
PublicationDate_xml	– month: 10 year: 2024 text: 2024-10-23 day: 23
PublicationDecade	2020
PublicationYear	2024
Score	3.8785737
SecondaryResourceType	preprint
Snippet	The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model,...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Learning
Title	Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
URI	https://arxiv.org/abs/2410.18252
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQMbI0AZ0yZ6ALbMqb6pqkmSfpJpmZANNyWpJ5cpKpaWIy-Po2j2BzvwgLF1fQMTkKsL0wiUUVmWWQ84GTivWB1QswUwM7McBCltnICLRky90_AjI5CT6KC6oeoQ7YxgQLIVUSboIM_NDWnYIjJDqEGJhS80QY_ByLK_OSQYfQAnvZCkE-Hm5WCm6JoAMKFIDdeAXf_KJUBVfwSQ7ACkDBPy1NF3JYL1CpArBJqeADHVJUAN1bllMsyiDv5hri7KELdkF8AeS4iHiQ4-LBjjMWY2ABdupTJRgULE0MzVNM04D9HdDtz8ZJFklmSRaWBslJxhYGicYphpIMErhMkcItJc3AZQSsdEFlq5GxDANLSVFpqiwDc3FKqRw45AAfNG7N
link.rule.ids	228,230,782,887
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Asynchronous+RLHF%3A+Faster+and+More+Efficient+Off-Policy+RL+for+Language+Models&rft.au=Noukhovitch%2C+Michael&rft.au=Huang%2C+Shengyi&rft.au=Xhonneux%2C+Sophie&rft.au=Hosseini%2C+Arian&rft.date=2024-10-23&rft_id=info:doi/10.48550%2Farxiv.2410.18252&rft.externalDocID=2410_18252