Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classi...

Full description

Saved in:
Bibliographic Details
Main Authors: Noukhovitch, Michael, Huang, Shengyi, Xhonneux, Sophie, Hosseini, Arian, Agarwal, Rishabh, Courville, Aaron
Format: Journal Article
Language:English
Published: 23-10-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.
AbstractList The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.
Author Courville, Aaron
Noukhovitch, Michael
Huang, Shengyi
Hosseini, Arian
Agarwal, Rishabh
Xhonneux, Sophie
Author_xml – sequence: 1
  givenname: Michael
  surname: Noukhovitch
  fullname: Noukhovitch, Michael
– sequence: 2
  givenname: Shengyi
  surname: Huang
  fullname: Huang, Shengyi
– sequence: 3
  givenname: Sophie
  surname: Xhonneux
  fullname: Xhonneux, Sophie
– sequence: 4
  givenname: Arian
  surname: Hosseini
  fullname: Hosseini, Arian
– sequence: 5
  givenname: Rishabh
  surname: Agarwal
  fullname: Agarwal, Rishabh
– sequence: 6
  givenname: Aaron
  surname: Courville
  fullname: Courville, Aaron
BackLink https://doi.org/10.48550/arXiv.2410.18252$$DView paper in arXiv
BookMark eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIBChhaGJkacTL4ORZX5iVnFOXn5ZcWKwT5eLhZKbglFpekFikk5qUo-OYXpSq4pqVlJmem5pUo-Kel6Qbk52QmVwKVKqTlFyn4JOallyampwJVpqTmFPMwsKYl5hSn8kJpbgZ5N9cQZw9dsM3xBUWZuYlFlfEgF8SDXWBMWAUAGUk8Gg
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by-nc-nd/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by-nc-nd/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2410.18252
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2410_18252
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2410_182523
IEDL.DBID GOX
IngestDate Sat Oct 26 12:24:57 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2410_182523
OpenAccessLink https://arxiv.org/abs/2410.18252
ParticipantIDs arxiv_primary_2410_18252
PublicationCentury 2000
PublicationDate 2024-10-23
PublicationDateYYYYMMDD 2024-10-23
PublicationDate_xml – month: 10
  year: 2024
  text: 2024-10-23
  day: 23
PublicationDecade 2020
PublicationYear 2024
Score 3.8785737
SecondaryResourceType preprint
Snippet The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model,...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Learning
Title Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
URI https://arxiv.org/abs/2410.18252
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQMbI0AZ0yZ6ALbMqb6pqkmSfpJpmZANNyWpJ5cpKpaWIy-Po2j2BzvwgLF1fQMTkKsL0wiUUVmWWQ84GTivWB1QswUwM7McBCltnICLRky90_AjI5CT6KC6oeoQ7YxgQLIVUSboIM_NDWnYIjJDqEGJhS80QY_ByLK_OSQYfQAnvZCkE-Hm5WCm6JoAMKFIDdeAXf_KJUBVfwSQ7ACkDBPy1NF3JYL1CpArBJqeADHVJUAN1bllMsyiDv5hri7KELdkF8AeS4iHiQ4-LBjjMWY2ABdupTJRgULE0MzVNM04D9HdDtz8ZJFklmSRaWBslJxhYGicYphpIMErhMkcItJc3AZQSsdEFlq5GxDANLSVFpqiwDc3FKqRw45AAfNG7N
link.rule.ids 228,230,782,887
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Asynchronous+RLHF%3A+Faster+and+More+Efficient+Off-Policy+RL+for+Language+Models&rft.au=Noukhovitch%2C+Michael&rft.au=Huang%2C+Shengyi&rft.au=Xhonneux%2C+Sophie&rft.au=Hosseini%2C+Arian&rft.date=2024-10-23&rft_id=info:doi/10.48550%2Farxiv.2410.18252&rft.externalDocID=2410_18252