Hierarchical Character Embeddings: Learning Phonological and Semantic Representations in Languages of Logographic Origin Using Recursive Neural Networks

Logographs (Chinese characters) have recursive structures (i.e. hierarchies of sub-units in logographs) that contain phonological and semantic information, as developmental psychology literature suggests that native speakers leverage on the structures to learn how to read. Exploiting these structure...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE/ACM transactions on audio, speech, and language processing Vol. 28; pp. 461 - 473
Main Authors:	Nguyen, Minh, Ngo, Gia H., Chen, Nancy F.
Format:	Journal Article
Language:	English
Published:	Piscataway IEEE 2020 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Binary trees Cantonese Data models Diagnostic systems embeddings Empirical analysis Hierarchies Human behavior Language modeling Learning Logograms logograph Mapping Modelling morphology Neural networks Phonology Predictive models Pronunciation Psychology Reading acquisition Recursion Recursive structure Semantics Task analysis Training
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Logographs (Chinese characters) have recursive structures (i.e. hierarchies of sub-units in logographs) that contain phonological and semantic information, as developmental psychology literature suggests that native speakers leverage on the structures to learn how to read. Exploiting these structures could potentially lead to better embeddings that can benefit many downstream tasks. We propose building hierarchical logograph (character) embeddings from logograph recursive structures using treeLSTM, a recursive neural network. Using recursive neural network imposes a prior on the mapping from logographs to embeddings since the network must read in the sub-units in logographs according to the order specified by the recursive structures. Based on human behavior in language learning and reading, we hypothesize that modeling logographs' structures using recursive neural network should be beneficial. To verify this claim, we consider two tasks (1) predicting logographs' Cantonese pronunciation from logographic structures and (2) language modeling. Empirical results show that the proposed hierarchical embeddings outperform baseline approaches. Diagnostic analysis suggests that hierarchical embeddings constructed using treeLSTM is less sensitive to distractors, thus is more robust, especially on complex logographs.
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2019.2955246