Sanskrit Sandhi Splitting using seq2(seq)^2
In Sanskrit, small words (morphemes) are combined to form compound words through a process known as Sandhi. Sandhi splitting is the process of splitting a given compound word into its constituent morphemes. Although rules governing word splitting exists in the language, it is highly challenging to i...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
01-01-2018
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In Sanskrit, small words (morphemes) are combined to form compound words
through a process known as Sandhi. Sandhi splitting is the process of splitting
a given compound word into its constituent morphemes. Although rules governing
word splitting exists in the language, it is highly challenging to identify the
location of the splits in a compound word. Though existing Sandhi splitting
systems incorporate these pre-defined splitting rules, they have a low accuracy
as the same compound word might be broken down in multiple ways to provide
syntactically correct splits.
In this research, we propose a novel deep learning architecture called Double
Decoder RNN (DD-RNN), which (i) predicts the location of the split(s) with 95%
accuracy, and (ii) predicts the constituent words (learning the Sandhi
splitting rules) with 79.5% accuracy, outperforming the state-of-art by 20%.
Additionally, we show the generalization capability of our deep learning model,
by showing competitive results in the problem of Chinese word segmentation, as
well. |
---|---|
DOI: | 10.48550/arxiv.1801.00428 |