Voices Unheard: NLP Resources and Models for Yor\`ub\'a Regional Dialects
Yor\`ub\'a an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little...
Saved in:
Main Authors: | , , , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
27-06-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Yor\`ub\'a an African language with roughly 47 million speakers encompasses a
continuum with several dialects. Recent efforts to develop NLP technologies for
African languages have focused on their standard dialects, resulting in
disparities for dialects and varieties for which there are little to no
resources or tools. We take steps towards bridging this gap by introducing a
new high-quality parallel text and speech corpus YOR\`ULECT across three
domains and four regional Yor\`ub\'a dialects. To develop this corpus, we
engaged native speakers, travelling to communities where these dialects are
spoken, to collect text and speech data. Using our newly created corpus, we
conducted extensive experiments on (text) machine translation, automatic speech
recognition, and speech-to-text translation. Our results reveal substantial
performance disparities between standard Yor\`ub\'a and the other dialects
across all tasks. However, we also show that with dialect-adaptive finetuning,
we are able to narrow this gap. We believe our dataset and experimental
analysis will contribute greatly to developing NLP tools for Yor\`ub\'a and its
dialects, and potentially for other African languages, by improving our
understanding of existing challenges and offering a high-quality dataset for
further development. We release YOR\`ULECT dataset and models publicly under an
open license. |
---|---|
DOI: | 10.48550/arxiv.2406.19564 |