Training trajectories, mini-batch losses and the curious role of the learning rate

Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its ability to converge to a global minimum remains shrouded in mystery. In this paper we propose to study the behavior of the loss function on fixed mini-batches along SGD trajectories. We show...

Full description

Saved in:
Bibliographic Details
Main Authors: Sandler, Mark, Zhmoginov, Andrey, Vladymyrov, Max, Miller, Nolan
Format: Journal Article
Language:English
Published: 05-01-2023
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its ability to converge to a global minimum remains shrouded in mystery. In this paper we propose to study the behavior of the loss function on fixed mini-batches along SGD trajectories. We show that the loss function on a fixed batch appears to be remarkably convex-like. In particular for ResNet the loss for any fixed mini-batch can be accurately modeled by a quadratic function and a very low loss value can be reached in just one step of gradient descent with sufficiently large learning rate. We propose a simple model that allows to analyze the relationship between the gradients of stochastic mini-batches and the full batch. Our analysis allows us to discover the equivalency between iterate aggregates and specific learning rate schedules. In particular, for Exponential Moving Average (EMA) and Stochastic Weight Averaging we show that our proposed model matches the observed training trajectories on ImageNet. Our theoretical model predicts that an even simpler averaging technique, averaging just two points a many steps apart, significantly improves accuracy compared to the baseline. We validated our findings on ImageNet and other datasets using ResNet architecture.
AbstractList Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its ability to converge to a global minimum remains shrouded in mystery. In this paper we propose to study the behavior of the loss function on fixed mini-batches along SGD trajectories. We show that the loss function on a fixed batch appears to be remarkably convex-like. In particular for ResNet the loss for any fixed mini-batch can be accurately modeled by a quadratic function and a very low loss value can be reached in just one step of gradient descent with sufficiently large learning rate. We propose a simple model that allows to analyze the relationship between the gradients of stochastic mini-batches and the full batch. Our analysis allows us to discover the equivalency between iterate aggregates and specific learning rate schedules. In particular, for Exponential Moving Average (EMA) and Stochastic Weight Averaging we show that our proposed model matches the observed training trajectories on ImageNet. Our theoretical model predicts that an even simpler averaging technique, averaging just two points a many steps apart, significantly improves accuracy compared to the baseline. We validated our findings on ImageNet and other datasets using ResNet architecture.
Author Vladymyrov, Max
Miller, Nolan
Zhmoginov, Andrey
Sandler, Mark
Author_xml – sequence: 1
  givenname: Mark
  surname: Sandler
  fullname: Sandler, Mark
– sequence: 2
  givenname: Andrey
  surname: Zhmoginov
  fullname: Zhmoginov, Andrey
– sequence: 3
  givenname: Max
  surname: Vladymyrov
  fullname: Vladymyrov, Max
– sequence: 4
  givenname: Nolan
  surname: Miller
  fullname: Miller, Nolan
BackLink https://doi.org/10.48550/arXiv.2301.02312$$DView paper in arXiv
BookMark eNotj8tKxDAUQLPQxTj6AbMyH2BrHk3TLmXwBQOCdF9u2hsn0mnkJiP692qd1YGzOHAu2NkcZ2RsI0VZNcaIW6Cv8FkqLWQplJZqxV47gjCH-Y1ngncccqSA6YYffmXhIA97PsWUMHGYR573yIcjhXhMnOKEPPrFTQi0RAgyXrJzD1PCqxPXrHu477ZPxe7l8Xl7tyugtqpoGgBp5Oi09wqrVoD1UnmhwFbSgcLWWaydHbWxXtdG1lKodhDSNb6y4PSaXf9nl6n-g8IB6Lv_m-uXOf0Dv4NL8Q
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2301.02312
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2301_02312
GroupedDBID AKY
GOX
ID FETCH-LOGICAL-a672-88aa151db3ff2e490a7f12f02a741ba2e9b7e6b7d357f365161029c01b8f47ab3
IEDL.DBID GOX
IngestDate Mon Jan 08 05:45:34 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a672-88aa151db3ff2e490a7f12f02a741ba2e9b7e6b7d357f365161029c01b8f47ab3
OpenAccessLink https://arxiv.org/abs/2301.02312
ParticipantIDs arxiv_primary_2301_02312
PublicationCentury 2000
PublicationDate 2023-01-05
PublicationDateYYYYMMDD 2023-01-05
PublicationDate_xml – month: 01
  year: 2023
  text: 2023-01-05
  day: 05
PublicationDecade 2020
PublicationYear 2023
Score 1.8696101
SecondaryResourceType preprint
Snippet Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its ability to converge to a global minimum remains...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Learning
Title Training trajectories, mini-batch losses and the curious role of the learning rate
URI https://arxiv.org/abs/2301.02312
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV07T8QwDI64m1gQCNDxVAZGIpqkbdoRwR03gQQ33FbFbcxD6Irugfj52GkRLKyJlcGR7M928n1CXPg8RUwzoyj7FSrFAlRRg1aERRCMBd0gT3SnT-5-XtyOmSZH_vyF8cuv18-OHxhWV4SPmVLTsozwwBh-snX3MO-Gk5GKq7f_tSOMGZf-JInJrtjp0Z287q5jT2yFxb54nPUyDHK99G-xS07l6aVkVg8FFApf5HvLs1dJVb0kQCa5BU4VueSnf7LFuNbLOzxL5nY4ELPJeHYzVb2UgfK5o5BTeE-ptQGLaEJaJt6hNpgYTwkdvAkluJCDa2zm0OYZwbDElHWiocDUebCHYrhoF2EkpEbCRKFOdZNR5WQt-NL7rKnRaj4Yj8QoOqD66NgqKvZNFX1z_P_WidhmHfXYW8hOxXC93IQzMVg1m_Po8m_Jxn_M
link.rule.ids 228,230,782,887
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Training+trajectories%2C+mini-batch+losses+and+the+curious+role+of+the+learning+rate&rft.au=Sandler%2C+Mark&rft.au=Zhmoginov%2C+Andrey&rft.au=Vladymyrov%2C+Max&rft.au=Miller%2C+Nolan&rft.date=2023-01-05&rft_id=info:doi/10.48550%2Farxiv.2301.02312&rft.externalDocID=2301_02312