Stemming techniques for Arabic words: A comparative study

Text interpretation depends among other things on a pre-processing stage in extracting effectively a correct stem or root. Since there is no available standard stemmer for Arabic, we address here five methods for extracting Arabic roots and the outcomes of the approach with best results will be used...

Full description

Saved in:
Bibliographic Details
Published in:2010 2nd International Conference on Computer Technology and Development pp. 270 - 276
Main Authors: Al-Nashashibi, M Y, Neagu, D, Yaghi, A A
Format: Conference Proceeding
Language:English
Published: IEEE 01-11-2010
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Text interpretation depends among other things on a pre-processing stage in extracting effectively a correct stem or root. Since there is no available standard stemmer for Arabic, we address here five methods for extracting Arabic roots and the outcomes of the approach with best results will be used later on. Four of these methods are based on a positional-letter-ranking approach where such an approach is investigated along with an adjustment, and two proposed variants. The fifth one is a rule-based approach. An algorithm for correcting irregular words is applied for all methods and a comparison is made between all approaches. The accuracy of these methods was found by comparing extracted roots with a predefined list of roots using an in-house text collection. Results show that the correction algorithm improved the accuracy of the rule-based one by about 14% and the positional letter ranking based algorithms by 7% to 10%. The adjusted positional letter ranking method proved to be the highest in accuracy among all five algorithms but slightly higher than the rule-based one. However, the rule-based algorithm was found to be the approach with the highest accuracy among all ten algorithms when the correction algorithm was included in it.
ISBN:9781424488445
1424488443
DOI:10.1109/ICCTD.2010.5645873