Training trajectories, mini-batch losses and the curious role of the learning rate

Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its ability to converge to a global minimum remains shrouded in mystery. In this paper we propose to study the behavior of the loss function on fixed mini-batches along SGD trajectories. We show...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sandler, Mark, Zhmoginov, Andrey, Vladymyrov, Max, Miller, Nolan
Format:	Journal Article
Language:	English
Published:	05-01-2023
Subjects:	Computer Science - Learning
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its ability to converge to a global minimum remains shrouded in mystery. In this paper we propose to study the behavior of the loss function on fixed mini-batches along SGD trajectories. We show that the loss function on a fixed batch appears to be remarkably convex-like. In particular for ResNet the loss for any fixed mini-batch can be accurately modeled by a quadratic function and a very low loss value can be reached in just one step of gradient descent with sufficiently large learning rate. We propose a simple model that allows to analyze the relationship between the gradients of stochastic mini-batches and the full batch. Our analysis allows us to discover the equivalency between iterate aggregates and specific learning rate schedules. In particular, for Exponential Moving Average (EMA) and Stochastic Weight Averaging we show that our proposed model matches the observed training trajectories on ImageNet. Our theoretical model predicts that an even simpler averaging technique, averaging just two points a many steps apart, significantly improves accuracy compared to the baseline. We validated our findings on ImageNet and other datasets using ResNet architecture.
AbstractList	Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its ability to converge to a global minimum remains shrouded in mystery. In this paper we propose to study the behavior of the loss function on fixed mini-batches along SGD trajectories. We show that the loss function on a fixed batch appears to be remarkably convex-like. In particular for ResNet the loss for any fixed mini-batch can be accurately modeled by a quadratic function and a very low loss value can be reached in just one step of gradient descent with sufficiently large learning rate. We propose a simple model that allows to analyze the relationship between the gradients of stochastic mini-batches and the full batch. Our analysis allows us to discover the equivalency between iterate aggregates and specific learning rate schedules. In particular, for Exponential Moving Average (EMA) and Stochastic Weight Averaging we show that our proposed model matches the observed training trajectories on ImageNet. Our theoretical model predicts that an even simpler averaging technique, averaging just two points a many steps apart, significantly improves accuracy compared to the baseline. We validated our findings on ImageNet and other datasets using ResNet architecture.
Author	Vladymyrov, Max Miller, Nolan Zhmoginov, Andrey Sandler, Mark
Author_xml	– sequence: 1 givenname: Mark surname: Sandler fullname: Sandler, Mark – sequence: 2 givenname: Andrey surname: Zhmoginov fullname: Zhmoginov, Andrey – sequence: 3 givenname: Max surname: Vladymyrov fullname: Vladymyrov, Max – sequence: 4 givenname: Nolan surname: Miller fullname: Miller, Nolan
BackLink	https://doi.org/10.48550/arXiv.2301.02312$$DView paper in arXiv
BookMark	eNotj8tKxDAUQLPQxTj6AbMyH2BrHk3TLmXwBQOCdF9u2hsn0mnkJiP692qd1YGzOHAu2NkcZ2RsI0VZNcaIW6Cv8FkqLWQplJZqxV47gjCH-Y1ngncccqSA6YYffmXhIA97PsWUMHGYR573yIcjhXhMnOKEPPrFTQi0RAgyXrJzD1PCqxPXrHu477ZPxe7l8Xl7tyugtqpoGgBp5Oi09wqrVoD1UnmhwFbSgcLWWaydHbWxXtdG1lKodhDSNb6y4PSaXf9nl6n-g8IB6Lv_m-uXOf0Dv4NL8Q
ContentType	Journal Article
Copyright	http://creativecommons.org/licenses/by/4.0
Copyright_xml	– notice: http://creativecommons.org/licenses/by/4.0
DBID	AKY GOX
DOI	10.48550/arxiv.2301.02312
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2301_02312
GroupedDBID	AKY GOX
ID	FETCH-LOGICAL-a672-88aa151db3ff2e490a7f12f02a741ba2e9b7e6b7d357f365161029c01b8f47ab3
IEDL.DBID	GOX
IngestDate	Mon Jan 08 05:45:34 EST 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a672-88aa151db3ff2e490a7f12f02a741ba2e9b7e6b7d357f365161029c01b8f47ab3
OpenAccessLink	https://arxiv.org/abs/2301.02312
ParticipantIDs	arxiv_primary_2301_02312
PublicationCentury	2000
PublicationDate	2023-01-05
PublicationDateYYYYMMDD	2023-01-05
PublicationDate_xml	– month: 01 year: 2023 text: 2023-01-05 day: 05
PublicationDecade	2020
PublicationYear	2023
Score	1.8696101
SecondaryResourceType	preprint
Snippet	Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its ability to converge to a global minimum remains...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Learning
Title	Training trajectories, mini-batch losses and the curious role of the learning rate
URI	https://arxiv.org/abs/2301.02312
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV07T8QwDI64m1gQCNDxVAZGIpqkbdoRwR03gQQ33FbFbcxD6Irugfj52GkRLKyJlcGR7M928n1CXPg8RUwzoyj7FSrFAlRRg1aERRCMBd0gT3SnT-5-XtyOmSZH_vyF8cuv18-OHxhWV4SPmVLTsozwwBh-snX3MO-Gk5GKq7f_tSOMGZf-JInJrtjp0Z287q5jT2yFxb54nPUyDHK99G-xS07l6aVkVg8FFApf5HvLs1dJVb0kQCa5BU4VueSnf7LFuNbLOzxL5nY4ELPJeHYzVb2UgfK5o5BTeE-ptQGLaEJaJt6hNpgYTwkdvAkluJCDa2zm0OYZwbDElHWiocDUebCHYrhoF2EkpEbCRKFOdZNR5WQt-NL7rKnRaj4Yj8QoOqD66NgqKvZNFX1z_P_WidhmHfXYW8hOxXC93IQzMVg1m_Po8m_Jxn_M
link.rule.ids	228,230,782,887
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Training+trajectories%2C+mini-batch+losses+and+the+curious+role+of+the+learning+rate&rft.au=Sandler%2C+Mark&rft.au=Zhmoginov%2C+Andrey&rft.au=Vladymyrov%2C+Max&rft.au=Miller%2C+Nolan&rft.date=2023-01-05&rft_id=info:doi/10.48550%2Farxiv.2301.02312&rft.externalDocID=2301_02312