Augmenting Training Data for Low-Resource Neural Machine Translation via Bilingual Word Embeddings and BERT Language Modelling

Neural machine translation (NMT) is often described as 'data hungry' as it typically requires large amounts of parallel data in order to build a good-quality machine translation (MT) system. However, most of the world's language-pairs are low-resource or extremely low-resource. This s...

Full description

Saved in:
Bibliographic Details
Published in:2021 International Joint Conference on Neural Networks (IJCNN) pp. 1 - 8
Main Authors: Ramesh, Akshai, Uhana, Haque Usuf, Parthasarathy, Venkatesh Balavadhani, Haque, Rejwanul, Way, Andy
Format: Conference Proceeding
Language:English
Published: IEEE 18-07-2021
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Neural machine translation (NMT) is often described as 'data hungry' as it typically requires large amounts of parallel data in order to build a good-quality machine translation (MT) system. However, most of the world's language-pairs are low-resource or extremely low-resource. This situation becomes even worse if a specialised domain is taken into consideration for translation. In this paper, we present a novel data augmentation method which makes use of bilingual word embeddings (BWEs) learned from monolingual corpora and bidirectional encoder representations from transformer (BERT) language models (LMs). We augment a parallel training corpus by introducing new words (i.e. out-of-vocabulary (OOV) items) and increasing the presence of rare words on both sides of the original parallel training corpus. Our experiments on the simulated low-resource German-English and French-English translation tasks show that the proposed data augmentation strategy can significantly improve state-of-the-art NMT systems and outperform the state-of-the-art data augmentation approach for low-resource NMT.
ISSN:2161-4407
DOI:10.1109/IJCNN52387.2021.9534211