Opening the Black Box of wav2vec Feature Encoder
Self-supervised models, namely, wav2vec and its variants, have shown promising results in various downstream tasks in the speech domain. However, their inner workings are poorly understood, calling for in-depth analyses on what the model learns. In this paper, we concentrate on the convolutional fea...
Saved in:
Main Authors: | , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
27-10-2022
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Self-supervised models, namely, wav2vec and its variants, have shown
promising results in various downstream tasks in the speech domain. However,
their inner workings are poorly understood, calling for in-depth analyses on
what the model learns. In this paper, we concentrate on the convolutional
feature encoder where its latent space is often speculated to represent
discrete acoustic units. To analyze the embedding space in a reductive manner,
we feed the synthesized audio signals, which is the summation of simple sine
waves. Through extensive experiments, we conclude that various information is
embedded inside the feature encoder representations: (1) fundamental frequency,
(2) formants, and (3) amplitude, packed with (4) sufficient temporal detail.
Further, the information incorporated inside the latent representations is
analogous to spectrograms but with a fundamental difference: latent
representations construct a metric space so that closer representations imply
acoustic similarity. |
---|---|
DOI: | 10.48550/arxiv.2210.15386 |