An improved root extraction technique for Arabic words

Arabic text interpretation depends among other things on a pre-processing stage in extracting effectively a correct stem or root. We address in this work a linguistic approach for root extraction as a pre-processing step for Arabic text mining. The linguistic approach is composed of a rule-based lig...

Full description

Saved in:

Bibliographic Details
Published in:	2010 2nd International Conference on Computer Technology and Development pp. 264 - 269
Main Authors:	Al-Nashashibi, M Y, Neagu, D, Yaghi, A A
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01-11-2010
Subjects:	Arabic Root Extraction Natural Language Processing Pragmatics Rule-Based Stemming Text Mining Weaving
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Arabic text interpretation depends among other things on a pre-processing stage in extracting effectively a correct stem or root. We address in this work a linguistic approach for root extraction as a pre-processing step for Arabic text mining. The linguistic approach is composed of a rule-based light stemmer and a pattern-based infix remover. We propose an algorithm to handle weak, eliminated-long-vowel, hamzated, and geminated words since the linguistic approach does not handle such cases and a reasonably large portion of Arabic words in texts are irregular. The accuracy of the extracted roots is determined by comparing them with a predefined list of 5,405 triliteral and quadriliteral roots. The linguistic approach performance (with and without the proposed correction algorithm) was tested on an in-house text collection of eight categories. The proposed correction algorithm improved the accuracy of the linguistic one by about 14%.
ISBN:	9781424488445 1424488443
DOI:	10.1109/ICCTD.2010.5645872