CebuaNER: A New Baseline Cebuano Named Entity Recognition Model
Despite being one of the most linguistically diverse groups of countries, computational linguistics and language processing research in Southeast Asia has struggled to match the level of countries from the Global North. Thus, initiatives such as open-sourcing corpora and the development of baseline...
Saved in:
Main Authors: | , , , , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
01-10-2023
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Despite being one of the most linguistically diverse groups of countries,
computational linguistics and language processing research in Southeast Asia
has struggled to match the level of countries from the Global North. Thus,
initiatives such as open-sourcing corpora and the development of baseline
models for basic language processing tasks are important stepping stones to
encourage the growth of research efforts in the field. To answer this call, we
introduce CebuaNER, a new baseline model for named entity recognition (NER) in
the Cebuano language. Cebuano is the second most-used native language in the
Philippines, with over 20 million speakers. To build the model, we collected
and annotated over 4,000 news articles, the largest of any work in the
language, retrieved from online local Cebuano platforms to train algorithms
such as Conditional Random Field and Bidirectional LSTM. Our findings show
promising results as a new baseline model, achieving over 70% performance on
precision, recall, and F1 across all entity tags, as well as potential efficacy
in a crosslingual setup with Tagalog. |
---|---|
DOI: | 10.48550/arxiv.2310.00679 |