An Automatic Language Identification System for Code-Mixed English-Kannada Social Media Text

The task of identifying the language of a document or word automatically is known as Language Identification (LID). With the increase in popularity of social media and smart devices, a huge number of people have come online. Majority of the user-generated data on web are code-mixed or multi-script f...

Full description

Saved in:
Bibliographic Details
Published in:2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS) pp. 1 - 5
Main Authors: Sowmya Lakshmi, B S, Shambhavi, B R
Format: Conference Proceeding
Language:English
Published: IEEE 01-12-2017
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The task of identifying the language of a document or word automatically is known as Language Identification (LID). With the increase in popularity of social media and smart devices, a huge number of people have come online. Majority of the user-generated data on web are code-mixed or multi-script form, where the words are represented in a non-native script. In this work, we focused on the problem of word-level LID for code-mixed data. Dataset collected contains English and Kannada code mixed sentences from social media posts. Experiments on various supervised classifiers are performed by embedding a dictionary module to handle word level code mixing.
DOI:10.1109/CSITSS.2017.8447784