Augmenting Training Data for Low-Resource Neural Machine Translation via Bilingual Word Embeddings and BERT Language Modelling
Neural machine translation (NMT) is often described as 'data hungry' as it typically requires large amounts of parallel data in order to build a good-quality machine translation (MT) system. However, most of the world's language-pairs are low-resource or extremely low-resource. This s...
Saved in:
Published in: | 2021 International Joint Conference on Neural Networks (IJCNN) pp. 1 - 8 |
---|---|
Main Authors: | , , , , |
Format: | Conference Proceeding |
Language: | English |
Published: |
IEEE
18-07-2021
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Neural machine translation (NMT) is often described as 'data hungry' as it typically requires large amounts of parallel data in order to build a good-quality machine translation (MT) system. However, most of the world's language-pairs are low-resource or extremely low-resource. This situation becomes even worse if a specialised domain is taken into consideration for translation. In this paper, we present a novel data augmentation method which makes use of bilingual word embeddings (BWEs) learned from monolingual corpora and bidirectional encoder representations from transformer (BERT) language models (LMs). We augment a parallel training corpus by introducing new words (i.e. out-of-vocabulary (OOV) items) and increasing the presence of rare words on both sides of the original parallel training corpus. Our experiments on the simulated low-resource German-English and French-English translation tasks show that the proposed data augmentation strategy can significantly improve state-of-the-art NMT systems and outperform the state-of-the-art data augmentation approach for low-resource NMT. |
---|---|
ISSN: | 2161-4407 |
DOI: | 10.1109/IJCNN52387.2021.9534211 |