Quantitative Fine-Grained Human Evaluation of Machine Translation Systems: a Case Study on English to Croatian
Machine Translation, pp 1-21, (2018), http://rdcu.be/GIkb This paper presents a quantitative fine-grained manual evaluation approach to comparing the performance of different machine translation (MT) systems. We build upon the well-established Multidimensional Quality Metrics (MQM) error taxonomy an...
Saved in:
Main Authors: | , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
02-02-2018
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Machine Translation, pp 1-21, (2018), http://rdcu.be/GIkb This paper presents a quantitative fine-grained manual evaluation approach to
comparing the performance of different machine translation (MT) systems. We
build upon the well-established Multidimensional Quality Metrics (MQM) error
taxonomy and implement a novel method that assesses whether the differences in
performance for MQM error types between different MT systems are statistically
significant. We conduct a case study for English-to-Croatian, a language
direction that involves translating into a morphologically rich language, for
which we compare three MT systems belonging to different paradigms: pure
phrase-based, factored phrase-based and neural. First, we design an
MQM-compliant error taxonomy tailored to the relevant linguistic phenomena of
Slavic languages, which made the annotation process feasible and accurate.
Errors in MT outputs were then annotated by two annotators following this
taxonomy. Subsequently, we carried out a statistical analysis which showed that
the best-performing system (neural) reduces the errors produced by the worst
system (pure phrase-based) by more than half (54\%). Moreover, we conducted an
additional analysis of agreement errors in which we distinguished between short
(phrase-level) and long distance (sentence-level) errors. We discovered that
phrase-based MT approaches are of limited use for long distance agreement
phenomena, for which neural MT was found to be especially effective. |
---|---|
DOI: | 10.48550/arxiv.1802.01451 |