MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation
There have been several meta-evaluation studies on the correlation between human ratings and offline machine translation (MT) evaluation metrics such as BLEU, chrF2, BertScore and COMET. These metrics have been used to evaluate simultaneous speech translation (SST) but their correlations with human...
Saved in:
Main Authors: | , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
15-11-2022
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | There have been several meta-evaluation studies on the correlation between
human ratings and offline machine translation (MT) evaluation metrics such as
BLEU, chrF2, BertScore and COMET. These metrics have been used to evaluate
simultaneous speech translation (SST) but their correlations with human ratings
of SST, which has been recently collected as Continuous Ratings (CR), are
unclear. In this paper, we leverage the evaluations of candidate systems
submitted to the English-German SST task at IWSLT 2022 and conduct an extensive
correlation analysis of CR and the aforementioned metrics. Our study reveals
that the offline metrics are well correlated with CR and can be reliably used
for evaluating machine translation in simultaneous mode, with some limitations
on the test set size. We conclude that given the current quality levels of SST,
these metrics can be used as proxies for CR, alleviating the need for large
scale human evaluation. Additionally, we observe that correlations of the
metrics with translation as a reference is significantly higher than with
simultaneous interpreting, and thus we recommend the former for reliable
evaluation. |
---|---|
DOI: | 10.48550/arxiv.2211.08633 |