A novel approach to the extraction of roots from Arabic words using bigrams

  • Authors:
  • Ismail I. Hmeidi;Riyad F. Al-Shalabi;Ahmad T. Al-Taani;Hassan Najadat;Shaker A. Al-Hazaimeh

  • Affiliations:
  • Department of Computer Information Systems, Jordan University of Science and Technology, P.O. Box 3030, Irbid 22110, Jordan;The Arab Academy for Banking and Financial Sciences, Amman, Jordan;Department of Computer Sciences, Yarmouk University, P.O. Box 566, Irbid 2211, Jordan;Department of Computer Information Systems, Jordan University of Science and Technology, P.O. Box 3030, Irbid 22110, Jordan;Department of Computer Information Systems, Jordan University of Science and Technology, P.O. Box 3030, Irbid 22110, Jordan

  • Venue:
  • Journal of the American Society for Information Science and Technology
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Root extraction is one of the most important topics in information retrieval (IR), natural language processing (NLP), text summarization, and many other important fields. In the last two decades, several algorithms have been proposed to extract Arabic roots. Most of these algorithms dealt with triliteral roots only, and some with fixed length words only. In this study, a novel approach to the extraction of roots from Arabic words using bigrams is proposed. Two similarity measures are used, the dissimilarity measure called the “Manhattan distance,” and Dice's measure of similarity. The proposed algorithm is tested on the Holy Qu'ran and on a corpus of 242 abstracts from the Proceedings of the Saudi Arabian National Computer Conferences. The two files used contain a wide range of data: the Holy Qu'ran contains most of the ancient Arabic words while the other file contains some modern Arabic words and some words borrowed from foreign languages in addition to the original Arabic words. The results of this study showed that combining N-grams with the Dice measure gives better results than using the Manhattan distance measure. © 2010 Wiley Periodicals, Inc.