An Information-Theoretic Analysis of Self-supervised Discrete Representations of Speech
Self-supervised representation learning for speech often involves a quantization step that transforms the acoustic input into discrete units. However, it remains unclear how to characterize the relationship between these discrete units and abstract phonetic categories such as phonemes. In this paper...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
04-06-2023
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Self-supervised representation learning for speech often involves a
quantization step that transforms the acoustic input into discrete units.
However, it remains unclear how to characterize the relationship between these
discrete units and abstract phonetic categories such as phonemes. In this
paper, we develop an information-theoretic framework whereby we represent each
phonetic category as a distribution over discrete units. We then apply our
framework to two different self-supervised models (namely wav2vec 2.0 and XLSR)
and use American English speech as a case study. Our study demonstrates that
the entropy of phonetic distributions reflects the variability of the
underlying speech sounds, with phonetically similar sounds exhibiting similar
distributions. While our study confirms the lack of direct, one-to-one
correspondence, we find an intriguing, indirect relationship between phonetic
categories and discrete units. |
---|---|
DOI: | 10.48550/arxiv.2306.02405 |