Ebaluatoia: crowd evaluation for English–Basque machine translation

This work explores the feasibility of a crowd-based pair-wise comparison evaluation to get feedback on machine translation progress for under-resourced languages. Specifically, we propose a task based on simple work units to compare the outputs of five English-to-Basque systems, which we implement i...

Full description

Saved in:
Bibliographic Details
Published in:Language Resources and Evaluation Vol. 51; no. 4; pp. 1053 - 1084
Main Authors: Aranberri, Nora, Labaka, Gorka, de Ilarraza, Arantza Díaz, Sarasola, Kepa
Format: Journal Article
Language:English
Published: Dordrecht Springer 01-12-2017
Springer Netherlands
Springer Nature B.V
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This work explores the feasibility of a crowd-based pair-wise comparison evaluation to get feedback on machine translation progress for under-resourced languages. Specifically, we propose a task based on simple work units to compare the outputs of five English-to-Basque systems, which we implement in a web application. In our design, we put forward two key aspects that we believe community collaboration initiatives should consider in order to attract and maintain participants, that is, providing both a community challenge and a personal challenge. We describe how these aspects can comply with a strict methodology to ensure research validity. In particular, we consider the evaluation set size and the characteristics of the test sentences, the number of evaluators per comparison pair, and a mechanism to identify dishonest participation (or participants with insufficient linguistic knowledge). We also describe our dissemination effort, which targeted both general users and interest groups. Over 500 people participated actively in the Ebaluatoia campaign and we were able to collect over 35,000 evaluations in a short period of 10 days. From the results, we complete the ranking of the systems under evaluation and establish whether the difference in quality between the systems is significant.
ISSN:1574-020X
1572-8412
1574-0218
DOI:10.1007/s10579-016-9335-x