SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations

There is a growing interest in cost-effective self-supervised fine-tuning (SSFT) of self-supervised learning (SSL)-based speech models to obtain task-specific representations. These task-specific representations are used for robust performance on various downstream tasks by fine-tuning on the labell...

Full description

Saved in:
Bibliographic Details
Main Authors: Meghanani, Amit, Hain, Thomas
Format: Journal Article
Language:English
Published: 10-03-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract There is a growing interest in cost-effective self-supervised fine-tuning (SSFT) of self-supervised learning (SSL)-based speech models to obtain task-specific representations. These task-specific representations are used for robust performance on various downstream tasks by fine-tuning on the labelled data. This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks. The proposed method uses a correspondence training strategy, aiming to learn similar representations from perturbed speech and original speech. Commonly used data augmentation techniques for content-related tasks (ASR) are applied to obtain perturbed speech. SCORE fine-tuned HuBERT outperforms the vanilla HuBERT on SUPERB benchmark with only a few hours of fine-tuning (< 5 hrs) on a single GPU for automatic speech recognition, phoneme recognition, and query-by-example tasks, with relative improvements of 1.09%, 3.58%, and 12.65%, respectively. SCORE provides competitive results with the recently proposed SSFT method SPIN, using only 1/3 of the processed speech compared to SPIN.
AbstractList There is a growing interest in cost-effective self-supervised fine-tuning (SSFT) of self-supervised learning (SSL)-based speech models to obtain task-specific representations. These task-specific representations are used for robust performance on various downstream tasks by fine-tuning on the labelled data. This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks. The proposed method uses a correspondence training strategy, aiming to learn similar representations from perturbed speech and original speech. Commonly used data augmentation techniques for content-related tasks (ASR) are applied to obtain perturbed speech. SCORE fine-tuned HuBERT outperforms the vanilla HuBERT on SUPERB benchmark with only a few hours of fine-tuning (< 5 hrs) on a single GPU for automatic speech recognition, phoneme recognition, and query-by-example tasks, with relative improvements of 1.09%, 3.58%, and 12.65%, respectively. SCORE provides competitive results with the recently proposed SSFT method SPIN, using only 1/3 of the processed speech compared to SPIN.
Author Meghanani, Amit
Hain, Thomas
Author_xml – sequence: 1
  givenname: Amit
  surname: Meghanani
  fullname: Meghanani, Amit
– sequence: 2
  givenname: Thomas
  surname: Hain
  fullname: Hain, Thomas
BackLink https://doi.org/10.48550/arXiv.2403.06260$$DView paper in arXiv
BookMark eNotj0FOwzAURL2ARSkcgFV9gYRvO3YSdihqoVKlSm3FNrKTb2SptSMnjeD2hJTVzOLNSO-B3PngkZBnBmlWSAkvOn67MeUZiBQUV7Agn8dqf1i_0iOebdJfO4yj67GlVYgR-y74Fn2DdOM8JsPVO_9FbYh0e-liGGfOD-gHesBu4qemBxd8_0jurT73-PSfS3LarE_VR7Lbv2-rt12iVQ4JMp7bFhmyEgphgOXSFoqVusx4WyhTWqWh1dJIKVgOvOB6GkgQ3DTMNCiWZHW7ncXqLrqLjj_1n2A9C4pfA8NNXQ
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2403.06260
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2403_06260
GroupedDBID AKY
GOX
ID FETCH-LOGICAL-a670-e127fde1e19083b0175f8619a942d86b9f6a0da5b553170282a27f5032bc1bce3
IEDL.DBID GOX
IngestDate Wed Mar 13 12:34:41 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a670-e127fde1e19083b0175f8619a942d86b9f6a0da5b553170282a27f5032bc1bce3
OpenAccessLink https://arxiv.org/abs/2403.06260
ParticipantIDs arxiv_primary_2403_06260
PublicationCentury 2000
PublicationDate 2024-03-10
PublicationDateYYYYMMDD 2024-03-10
PublicationDate_xml – month: 03
  year: 2024
  text: 2024-03-10
  day: 10
PublicationDecade 2020
PublicationYear 2024
Score 1.9151218
SecondaryResourceType preprint
Snippet There is a growing interest in cost-effective self-supervised fine-tuning (SSFT) of self-supervised learning (SSL)-based speech models to obtain task-specific...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Computation and Language
Computer Science - Sound
Title SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations
URI https://arxiv.org/abs/2403.06260
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LT8MwDLbYTlwQCNB4KgeugTbpkxsqHTsxaZvQbpXzqISEumldET8fJy2CC9fESSRHlj8n9meAuwApurHC1W3lyCOhJcfQGC5rYRKUodDongZmy_R1nT2XjiaH_dTC4O7r_bPnB1btgyOLuw8c5h7BSAiXsvUyX_efk56Ka5D_lSOM6Yf-OInpMRwN6I499ddxAge2OYW3ZTFflI9saT9q3nZbZ52tNazwjTG2m8a39WRTwnt837l3CkZIkvXhvpdzaeZ7tvBJq0OtUNOewWparooZH9oZcEzSgNtQpLWxoSUXnElFlhDXGYUvmEfCZInK6wQDg7GKySxSFwohLYgDKZQOlbbyHMbNprETYAQ6VJjltZaoIxXlSJtFOqLldIC18QVMvBKqbc9YUTn9VF4_l_9PXcGhII_NfbLaNYz3u87ewKg13a1X-ze1iYFl
link.rule.ids 228,230,782,887
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=SCORE%3A+Self-supervised+Correspondence+Fine-tuning+for+Improved+Content+Representations&rft.au=Meghanani%2C+Amit&rft.au=Hain%2C+Thomas&rft.date=2024-03-10&rft_id=info:doi/10.48550%2Farxiv.2403.06260&rft.externalDocID=2403_06260