A Comparison Study of Word Embedding for Detecting Named Entities of Code-Mixed Data in Indian Language
Communication has increased many-fold in the internet era, making social media a lively platform for the exchange of information. Most people use multiple or mixed languages in their conversations as they share contemporaneous information. Code Mixing is a technique which mixes two or more languages...
Saved in:
Published in: | 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI) pp. 2375 - 2381 |
---|---|
Main Authors: | , , |
Format: | Conference Proceeding |
Language: | English |
Published: |
IEEE
01-09-2018
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Communication has increased many-fold in the internet era, making social media a lively platform for the exchange of information. Most people use multiple or mixed languages in their conversations as they share contemporaneous information. Code Mixing is a technique which mixes two or more languages within a dialogue. The extraction of relevant and meaningful information from mixed set of languages poses a tedious exercise. The objective of the paper is to perform named entity recognition (NER), one of the challenging task in the domain of natural language processing. The method proposed herein explores a novel exhaustive comparison study, heretofore un-addressed among four word embedding approaches like Continuous Bag of Words model (CBOW), Skip gram model, Term Frequency and Inverse Document Frequency (TF-IDF) and Global Vectors for Word Representation (GloVe). These word vector representing schemes decipher the meaning of words in different dimensions, such as in code mixed language pair English-Hindi. These word vectors or feature vectors, computed from co-occurrences, yielded good cross-validation scores when compared with six conventional machine learning algorithms. The study reveals Tf-IDF is the best word embedding model yielding the highest accuracy for the small dataset. Precision, Recall, and F-measure were used as evaluation measures. |
---|---|
DOI: | 10.1109/ICACCI.2018.8554918 |