Exploiting Dialect Identification in Automatic Dialectal Text Normalization
Dialectal Arabic is the primary spoken language used by native Arabic speakers in daily communication. The rise of social media platforms has notably expanded its use as a written language. However, Arabic dialects do not have standard orthographies. This, combined with the inherent noise in user-ge...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
03-07-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Dialectal Arabic is the primary spoken language used by native Arabic
speakers in daily communication. The rise of social media platforms has notably
expanded its use as a written language. However, Arabic dialects do not have
standard orthographies. This, combined with the inherent noise in
user-generated content on social media, presents a major challenge to NLP
applications dealing with Dialectal Arabic. In this paper, we explore and
report on the task of CODAfication, which aims to normalize Dialectal Arabic
into the Conventional Orthography for Dialectal Arabic (CODA). We work with a
unique parallel corpus of multiple Arabic dialects focusing on five major city
dialects. We benchmark newly developed pretrained sequence-to-sequence models
on the task of CODAfication. We further show that using dialect identification
information improves the performance across all dialects. We make our code,
data, and pretrained models publicly available. |
---|---|
DOI: | 10.48550/arxiv.2407.03020 |