An Automatic Language Identification System for Code-Mixed English-Kannada Social Media Text

The task of identifying the language of a document or word automatically is known as Language Identification (LID). With the increase in popularity of social media and smart devices, a huge number of people have come online. Majority of the user-generated data on web are code-mixed or multi-script f...

Full description

Saved in:

Bibliographic Details
Published in:	2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS) pp. 1 - 5
Main Authors:	Sowmya Lakshmi, B S, Shambhavi, B R
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01-12-2017
Subjects:	Classification algorithms Code-mixing Cross Validation Dictionaries Facebook Feature extraction Hidden Markov models Logistics Machine Learning Supervised classification
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The task of identifying the language of a document or word automatically is known as Language Identification (LID). With the increase in popularity of social media and smart devices, a huge number of people have come online. Majority of the user-generated data on web are code-mixed or multi-script form, where the words are represented in a non-native script. In this work, we focused on the problem of word-level LID for code-mixed data. Dataset collected contains English and Kannada code mixed sentences from social media posts. Experiments on various supervised classifiers are performed by embedding a dictionary module to handle word level code mixing.
DOI:	10.1109/CSITSS.2017.8447784