SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations

There is a growing interest in cost-effective self-supervised fine-tuning (SSFT) of self-supervised learning (SSL)-based speech models to obtain task-specific representations. These task-specific representations are used for robust performance on various downstream tasks by fine-tuning on the labell...

Full description

Saved in:

Bibliographic Details
Main Authors:	Meghanani, Amit, Hain, Thomas
Format:	Journal Article
Language:	English
Published:	10-03-2024
Subjects:	Computer Science - Computation and Language Computer Science - Sound
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	There is a growing interest in cost-effective self-supervised fine-tuning (SSFT) of self-supervised learning (SSL)-based speech models to obtain task-specific representations. These task-specific representations are used for robust performance on various downstream tasks by fine-tuning on the labelled data. This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks. The proposed method uses a correspondence training strategy, aiming to learn similar representations from perturbed speech and original speech. Commonly used data augmentation techniques for content-related tasks (ASR) are applied to obtain perturbed speech. SCORE fine-tuned HuBERT outperforms the vanilla HuBERT on SUPERB benchmark with only a few hours of fine-tuning (< 5 hrs) on a single GPU for automatic speech recognition, phoneme recognition, and query-by-example tasks, with relative improvements of 1.09%, 3.58%, and 12.65%, respectively. SCORE provides competitive results with the recently proposed SSFT method SPIN, using only 1/3 of the processed speech compared to SPIN.
AbstractList	There is a growing interest in cost-effective self-supervised fine-tuning (SSFT) of self-supervised learning (SSL)-based speech models to obtain task-specific representations. These task-specific representations are used for robust performance on various downstream tasks by fine-tuning on the labelled data. This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks. The proposed method uses a correspondence training strategy, aiming to learn similar representations from perturbed speech and original speech. Commonly used data augmentation techniques for content-related tasks (ASR) are applied to obtain perturbed speech. SCORE fine-tuned HuBERT outperforms the vanilla HuBERT on SUPERB benchmark with only a few hours of fine-tuning (< 5 hrs) on a single GPU for automatic speech recognition, phoneme recognition, and query-by-example tasks, with relative improvements of 1.09%, 3.58%, and 12.65%, respectively. SCORE provides competitive results with the recently proposed SSFT method SPIN, using only 1/3 of the processed speech compared to SPIN.
Author	Meghanani, Amit Hain, Thomas
Author_xml	– sequence: 1 givenname: Amit surname: Meghanani fullname: Meghanani, Amit – sequence: 2 givenname: Thomas surname: Hain fullname: Hain, Thomas
BackLink	https://doi.org/10.48550/arXiv.2403.06260$$DView paper in arXiv
BookMark	eNotj0FOwzAURL2ARSkcgFV9gYRvO3YSdihqoVKlSm3FNrKTb2SptSMnjeD2hJTVzOLNSO-B3PngkZBnBmlWSAkvOn67MeUZiBQUV7Agn8dqf1i_0iOebdJfO4yj67GlVYgR-y74Fn2DdOM8JsPVO_9FbYh0e-liGGfOD-gHesBu4qemBxd8_0jurT73-PSfS3LarE_VR7Lbv2-rt12iVQ4JMp7bFhmyEgphgOXSFoqVusx4WyhTWqWh1dJIKVgOvOB6GkgQ3DTMNCiWZHW7ncXqLrqLjj_1n2A9C4pfA8NNXQ
ContentType	Journal Article
Copyright	http://creativecommons.org/licenses/by/4.0
Copyright_xml	– notice: http://creativecommons.org/licenses/by/4.0
DBID	AKY GOX
DOI	10.48550/arxiv.2403.06260
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2403_06260
GroupedDBID	AKY GOX
ID	FETCH-LOGICAL-a670-e127fde1e19083b0175f8619a942d86b9f6a0da5b553170282a27f5032bc1bce3
IEDL.DBID	GOX
IngestDate	Wed Mar 13 12:34:41 EDT 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a670-e127fde1e19083b0175f8619a942d86b9f6a0da5b553170282a27f5032bc1bce3
OpenAccessLink	https://arxiv.org/abs/2403.06260
ParticipantIDs	arxiv_primary_2403_06260
PublicationCentury	2000
PublicationDate	2024-03-10
PublicationDateYYYYMMDD	2024-03-10
PublicationDate_xml	– month: 03 year: 2024 text: 2024-03-10 day: 10
PublicationDecade	2020
PublicationYear	2024
Score	1.9151218
SecondaryResourceType	preprint
Snippet	There is a growing interest in cost-effective self-supervised fine-tuning (SSFT) of self-supervised learning (SSL)-based speech models to obtain task-specific...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Computation and Language Computer Science - Sound
Title	SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations
URI	https://arxiv.org/abs/2403.06260
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LT8MwDLbYTlwQCNB4KgeugTbpkxsqHTsxaZvQbpXzqISEumldET8fJy2CC9fESSRHlj8n9meAuwApurHC1W3lyCOhJcfQGC5rYRKUodDongZmy_R1nT2XjiaH_dTC4O7r_bPnB1btgyOLuw8c5h7BSAiXsvUyX_efk56Ka5D_lSOM6Yf-OInpMRwN6I499ddxAge2OYW3ZTFflI9saT9q3nZbZ52tNazwjTG2m8a39WRTwnt837l3CkZIkvXhvpdzaeZ7tvBJq0OtUNOewWparooZH9oZcEzSgNtQpLWxoSUXnElFlhDXGYUvmEfCZInK6wQDg7GKySxSFwohLYgDKZQOlbbyHMbNprETYAQ6VJjltZaoIxXlSJtFOqLldIC18QVMvBKqbc9YUTn9VF4_l_9PXcGhII_NfbLaNYz3u87ewKg13a1X-ze1iYFl
link.rule.ids	228,230,782,887
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=SCORE%3A+Self-supervised+Correspondence+Fine-tuning+for+Improved+Content+Representations&rft.au=Meghanani%2C+Amit&rft.au=Hain%2C+Thomas&rft.date=2024-03-10&rft_id=info:doi/10.48550%2Farxiv.2403.06260&rft.externalDocID=2403_06260