An improved root extraction technique for Arabic words

Arabic text interpretation depends among other things on a pre-processing stage in extracting effectively a correct stem or root. We address in this work a linguistic approach for root extraction as a pre-processing step for Arabic text mining. The linguistic approach is composed of a rule-based lig...

Full description

Saved in:
Bibliographic Details
Published in:2010 2nd International Conference on Computer Technology and Development pp. 264 - 269
Main Authors: Al-Nashashibi, M Y, Neagu, D, Yaghi, A A
Format: Conference Proceeding
Language:English
Published: IEEE 01-11-2010
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Arabic text interpretation depends among other things on a pre-processing stage in extracting effectively a correct stem or root. We address in this work a linguistic approach for root extraction as a pre-processing step for Arabic text mining. The linguistic approach is composed of a rule-based light stemmer and a pattern-based infix remover. We propose an algorithm to handle weak, eliminated-long-vowel, hamzated, and geminated words since the linguistic approach does not handle such cases and a reasonably large portion of Arabic words in texts are irregular. The accuracy of the extracted roots is determined by comparing them with a predefined list of 5,405 triliteral and quadriliteral roots. The linguistic approach performance (with and without the proposed correction algorithm) was tested on an in-house text collection of eight categories. The proposed correction algorithm improved the accuracy of the linguistic one by about 14%.
ISBN:9781424488445
1424488443
DOI:10.1109/ICCTD.2010.5645872