Training trajectories, mini-batch losses and the curious role of the learning rate
Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its ability to converge to a global minimum remains shrouded in mystery. In this paper we propose to study the behavior of the loss function on fixed mini-batches along SGD trajectories. We show...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
05-01-2023
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Abstract | Stochastic gradient descent plays a fundamental role in nearly all
applications of deep learning. However its ability to converge to a global
minimum remains shrouded in mystery. In this paper we propose to study the
behavior of the loss function on fixed mini-batches along SGD trajectories. We
show that the loss function on a fixed batch appears to be remarkably
convex-like. In particular for ResNet the loss for any fixed mini-batch can be
accurately modeled by a quadratic function and a very low loss value can be
reached in just one step of gradient descent with sufficiently large learning
rate. We propose a simple model that allows to analyze the relationship between
the gradients of stochastic mini-batches and the full batch. Our analysis
allows us to discover the equivalency between iterate aggregates and specific
learning rate schedules. In particular, for Exponential Moving Average (EMA)
and Stochastic Weight Averaging we show that our proposed model matches the
observed training trajectories on ImageNet. Our theoretical model predicts that
an even simpler averaging technique, averaging just two points a many steps
apart, significantly improves accuracy compared to the baseline. We validated
our findings on ImageNet and other datasets using ResNet architecture. |
---|---|
AbstractList | Stochastic gradient descent plays a fundamental role in nearly all
applications of deep learning. However its ability to converge to a global
minimum remains shrouded in mystery. In this paper we propose to study the
behavior of the loss function on fixed mini-batches along SGD trajectories. We
show that the loss function on a fixed batch appears to be remarkably
convex-like. In particular for ResNet the loss for any fixed mini-batch can be
accurately modeled by a quadratic function and a very low loss value can be
reached in just one step of gradient descent with sufficiently large learning
rate. We propose a simple model that allows to analyze the relationship between
the gradients of stochastic mini-batches and the full batch. Our analysis
allows us to discover the equivalency between iterate aggregates and specific
learning rate schedules. In particular, for Exponential Moving Average (EMA)
and Stochastic Weight Averaging we show that our proposed model matches the
observed training trajectories on ImageNet. Our theoretical model predicts that
an even simpler averaging technique, averaging just two points a many steps
apart, significantly improves accuracy compared to the baseline. We validated
our findings on ImageNet and other datasets using ResNet architecture. |
Author | Vladymyrov, Max Miller, Nolan Zhmoginov, Andrey Sandler, Mark |
Author_xml | – sequence: 1 givenname: Mark surname: Sandler fullname: Sandler, Mark – sequence: 2 givenname: Andrey surname: Zhmoginov fullname: Zhmoginov, Andrey – sequence: 3 givenname: Max surname: Vladymyrov fullname: Vladymyrov, Max – sequence: 4 givenname: Nolan surname: Miller fullname: Miller, Nolan |
BackLink | https://doi.org/10.48550/arXiv.2301.02312$$DView paper in arXiv |
BookMark | eNotj8tKxDAUQLPQxTj6AbMyH2BrHk3TLmXwBQOCdF9u2hsn0mnkJiP692qd1YGzOHAu2NkcZ2RsI0VZNcaIW6Cv8FkqLWQplJZqxV47gjCH-Y1ngncccqSA6YYffmXhIA97PsWUMHGYR573yIcjhXhMnOKEPPrFTQi0RAgyXrJzD1PCqxPXrHu477ZPxe7l8Xl7tyugtqpoGgBp5Oi09wqrVoD1UnmhwFbSgcLWWaydHbWxXtdG1lKodhDSNb6y4PSaXf9nl6n-g8IB6Lv_m-uXOf0Dv4NL8Q |
ContentType | Journal Article |
Copyright | http://creativecommons.org/licenses/by/4.0 |
Copyright_xml | – notice: http://creativecommons.org/licenses/by/4.0 |
DBID | AKY GOX |
DOI | 10.48550/arxiv.2301.02312 |
DatabaseName | arXiv Computer Science arXiv.org |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
ExternalDocumentID | 2301_02312 |
GroupedDBID | AKY GOX |
ID | FETCH-LOGICAL-a672-88aa151db3ff2e490a7f12f02a741ba2e9b7e6b7d357f365161029c01b8f47ab3 |
IEDL.DBID | GOX |
IngestDate | Mon Jan 08 05:45:34 EST 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a672-88aa151db3ff2e490a7f12f02a741ba2e9b7e6b7d357f365161029c01b8f47ab3 |
OpenAccessLink | https://arxiv.org/abs/2301.02312 |
ParticipantIDs | arxiv_primary_2301_02312 |
PublicationCentury | 2000 |
PublicationDate | 2023-01-05 |
PublicationDateYYYYMMDD | 2023-01-05 |
PublicationDate_xml | – month: 01 year: 2023 text: 2023-01-05 day: 05 |
PublicationDecade | 2020 |
PublicationYear | 2023 |
Score | 1.8696101 |
SecondaryResourceType | preprint |
Snippet | Stochastic gradient descent plays a fundamental role in nearly all
applications of deep learning. However its ability to converge to a global
minimum remains... |
SourceID | arxiv |
SourceType | Open Access Repository |
SubjectTerms | Computer Science - Learning |
Title | Training trajectories, mini-batch losses and the curious role of the learning rate |
URI | https://arxiv.org/abs/2301.02312 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV07T8QwDI64m1gQCNDxVAZGIpqkbdoRwR03gQQ33FbFbcxD6Irugfj52GkRLKyJlcGR7M928n1CXPg8RUwzoyj7FSrFAlRRg1aERRCMBd0gT3SnT-5-XtyOmSZH_vyF8cuv18-OHxhWV4SPmVLTsozwwBh-snX3MO-Gk5GKq7f_tSOMGZf-JInJrtjp0Z287q5jT2yFxb54nPUyDHK99G-xS07l6aVkVg8FFApf5HvLs1dJVb0kQCa5BU4VueSnf7LFuNbLOzxL5nY4ELPJeHYzVb2UgfK5o5BTeE-ptQGLaEJaJt6hNpgYTwkdvAkluJCDa2zm0OYZwbDElHWiocDUebCHYrhoF2EkpEbCRKFOdZNR5WQt-NL7rKnRaj4Yj8QoOqD66NgqKvZNFX1z_P_WidhmHfXYW8hOxXC93IQzMVg1m_Po8m_Jxn_M |
link.rule.ids | 228,230,782,887 |
linkProvider | Cornell University |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Training+trajectories%2C+mini-batch+losses+and+the+curious+role+of+the+learning+rate&rft.au=Sandler%2C+Mark&rft.au=Zhmoginov%2C+Andrey&rft.au=Vladymyrov%2C+Max&rft.au=Miller%2C+Nolan&rft.date=2023-01-05&rft_id=info:doi/10.48550%2Farxiv.2301.02312&rft.externalDocID=2301_02312 |