An Automatic Language Identification System for Code-Mixed English-Kannada Social Media Text
The task of identifying the language of a document or word automatically is known as Language Identification (LID). With the increase in popularity of social media and smart devices, a huge number of people have come online. Majority of the user-generated data on web are code-mixed or multi-script f...
Saved in:
Published in: | 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS) pp. 1 - 5 |
---|---|
Main Authors: | , |
Format: | Conference Proceeding |
Language: | English |
Published: |
IEEE
01-12-2017
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The task of identifying the language of a document or word automatically is known as Language Identification (LID). With the increase in popularity of social media and smart devices, a huge number of people have come online. Majority of the user-generated data on web are code-mixed or multi-script form, where the words are represented in a non-native script. In this work, we focused on the problem of word-level LID for code-mixed data. Dataset collected contains English and Kannada code mixed sentences from social media posts. Experiments on various supervised classifiers are performed by embedding a dictionary module to handle word level code mixing. |
---|---|
DOI: | 10.1109/CSITSS.2017.8447784 |