A Comparison Study of Word Embedding for Detecting Named Entities of Code-Mixed Data in Indian Language

Communication has increased many-fold in the internet era, making social media a lively platform for the exchange of information. Most people use multiple or mixed languages in their conversations as they share contemporaneous information. Code Mixing is a technique which mixes two or more languages...

Full description

Saved in:
Bibliographic Details
Published in:2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI) pp. 2375 - 2381
Main Authors: Sravani, Lolla, Reddy, Atla Sowmya, Thara, S
Format: Conference Proceeding
Language:English
Published: IEEE 01-09-2018
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Communication has increased many-fold in the internet era, making social media a lively platform for the exchange of information. Most people use multiple or mixed languages in their conversations as they share contemporaneous information. Code Mixing is a technique which mixes two or more languages within a dialogue. The extraction of relevant and meaningful information from mixed set of languages poses a tedious exercise. The objective of the paper is to perform named entity recognition (NER), one of the challenging task in the domain of natural language processing. The method proposed herein explores a novel exhaustive comparison study, heretofore un-addressed among four word embedding approaches like Continuous Bag of Words model (CBOW), Skip gram model, Term Frequency and Inverse Document Frequency (TF-IDF) and Global Vectors for Word Representation (GloVe). These word vector representing schemes decipher the meaning of words in different dimensions, such as in code mixed language pair English-Hindi. These word vectors or feature vectors, computed from co-occurrences, yielded good cross-validation scores when compared with six conventional machine learning algorithms. The study reveals Tf-IDF is the best word embedding model yielding the highest accuracy for the small dataset. Precision, Recall, and F-measure were used as evaluation measures.
DOI:10.1109/ICACCI.2018.8554918