SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations
There is a growing interest in cost-effective self-supervised fine-tuning (SSFT) of self-supervised learning (SSL)-based speech models to obtain task-specific representations. These task-specific representations are used for robust performance on various downstream tasks by fine-tuning on the labell...
Saved in:
Main Authors: | , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
10-03-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Abstract | There is a growing interest in cost-effective self-supervised fine-tuning
(SSFT) of self-supervised learning (SSL)-based speech models to obtain
task-specific representations. These task-specific representations are used for
robust performance on various downstream tasks by fine-tuning on the labelled
data. This work presents a cost-effective SSFT method named Self-supervised
Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for
content-related tasks. The proposed method uses a correspondence training
strategy, aiming to learn similar representations from perturbed speech and
original speech. Commonly used data augmentation techniques for content-related
tasks (ASR) are applied to obtain perturbed speech. SCORE fine-tuned HuBERT
outperforms the vanilla HuBERT on SUPERB benchmark with only a few hours of
fine-tuning (< 5 hrs) on a single GPU for automatic speech recognition, phoneme
recognition, and query-by-example tasks, with relative improvements of 1.09%,
3.58%, and 12.65%, respectively. SCORE provides competitive results with the
recently proposed SSFT method SPIN, using only 1/3 of the processed speech
compared to SPIN. |
---|---|
AbstractList | There is a growing interest in cost-effective self-supervised fine-tuning
(SSFT) of self-supervised learning (SSL)-based speech models to obtain
task-specific representations. These task-specific representations are used for
robust performance on various downstream tasks by fine-tuning on the labelled
data. This work presents a cost-effective SSFT method named Self-supervised
Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for
content-related tasks. The proposed method uses a correspondence training
strategy, aiming to learn similar representations from perturbed speech and
original speech. Commonly used data augmentation techniques for content-related
tasks (ASR) are applied to obtain perturbed speech. SCORE fine-tuned HuBERT
outperforms the vanilla HuBERT on SUPERB benchmark with only a few hours of
fine-tuning (< 5 hrs) on a single GPU for automatic speech recognition, phoneme
recognition, and query-by-example tasks, with relative improvements of 1.09%,
3.58%, and 12.65%, respectively. SCORE provides competitive results with the
recently proposed SSFT method SPIN, using only 1/3 of the processed speech
compared to SPIN. |
Author | Meghanani, Amit Hain, Thomas |
Author_xml | – sequence: 1 givenname: Amit surname: Meghanani fullname: Meghanani, Amit – sequence: 2 givenname: Thomas surname: Hain fullname: Hain, Thomas |
BackLink | https://doi.org/10.48550/arXiv.2403.06260$$DView paper in arXiv |
BookMark | eNotj0FOwzAURL2ARSkcgFV9gYRvO3YSdihqoVKlSm3FNrKTb2SptSMnjeD2hJTVzOLNSO-B3PngkZBnBmlWSAkvOn67MeUZiBQUV7Agn8dqf1i_0iOebdJfO4yj67GlVYgR-y74Fn2DdOM8JsPVO_9FbYh0e-liGGfOD-gHesBu4qemBxd8_0jurT73-PSfS3LarE_VR7Lbv2-rt12iVQ4JMp7bFhmyEgphgOXSFoqVusx4WyhTWqWh1dJIKVgOvOB6GkgQ3DTMNCiWZHW7ncXqLrqLjj_1n2A9C4pfA8NNXQ |
ContentType | Journal Article |
Copyright | http://creativecommons.org/licenses/by/4.0 |
Copyright_xml | – notice: http://creativecommons.org/licenses/by/4.0 |
DBID | AKY GOX |
DOI | 10.48550/arxiv.2403.06260 |
DatabaseName | arXiv Computer Science arXiv.org |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
ExternalDocumentID | 2403_06260 |
GroupedDBID | AKY GOX |
ID | FETCH-LOGICAL-a670-e127fde1e19083b0175f8619a942d86b9f6a0da5b553170282a27f5032bc1bce3 |
IEDL.DBID | GOX |
IngestDate | Wed Mar 13 12:34:41 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a670-e127fde1e19083b0175f8619a942d86b9f6a0da5b553170282a27f5032bc1bce3 |
OpenAccessLink | https://arxiv.org/abs/2403.06260 |
ParticipantIDs | arxiv_primary_2403_06260 |
PublicationCentury | 2000 |
PublicationDate | 2024-03-10 |
PublicationDateYYYYMMDD | 2024-03-10 |
PublicationDate_xml | – month: 03 year: 2024 text: 2024-03-10 day: 10 |
PublicationDecade | 2020 |
PublicationYear | 2024 |
Score | 1.9151218 |
SecondaryResourceType | preprint |
Snippet | There is a growing interest in cost-effective self-supervised fine-tuning
(SSFT) of self-supervised learning (SSL)-based speech models to obtain
task-specific... |
SourceID | arxiv |
SourceType | Open Access Repository |
SubjectTerms | Computer Science - Computation and Language Computer Science - Sound |
Title | SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations |
URI | https://arxiv.org/abs/2403.06260 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LT8MwDLbYTlwQCNB4KgeugTbpkxsqHTsxaZvQbpXzqISEumldET8fJy2CC9fESSRHlj8n9meAuwApurHC1W3lyCOhJcfQGC5rYRKUodDongZmy_R1nT2XjiaH_dTC4O7r_bPnB1btgyOLuw8c5h7BSAiXsvUyX_efk56Ka5D_lSOM6Yf-OInpMRwN6I499ddxAge2OYW3ZTFflI9saT9q3nZbZ52tNazwjTG2m8a39WRTwnt837l3CkZIkvXhvpdzaeZ7tvBJq0OtUNOewWparooZH9oZcEzSgNtQpLWxoSUXnElFlhDXGYUvmEfCZInK6wQDg7GKySxSFwohLYgDKZQOlbbyHMbNprETYAQ6VJjltZaoIxXlSJtFOqLldIC18QVMvBKqbc9YUTn9VF4_l_9PXcGhII_NfbLaNYz3u87ewKg13a1X-ze1iYFl |
link.rule.ids | 228,230,782,887 |
linkProvider | Cornell University |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=SCORE%3A+Self-supervised+Correspondence+Fine-tuning+for+Improved+Content+Representations&rft.au=Meghanani%2C+Amit&rft.au=Hain%2C+Thomas&rft.date=2024-03-10&rft_id=info:doi/10.48550%2Farxiv.2403.06260&rft.externalDocID=2403_06260 |