Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

  • Authors:
  • Leah S. Larkey;Lisa Ballesteros;Margaret E. Connell

  • Affiliations:
  • Univ. of Massachusetts, Amherst, MA;Mt. Holyoke College, South Hadley, MA;Univ. of Massachusetts, Amherst, MA

  • Venue:
  • SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Arabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stem驴ming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence for Arabic retrieval. We compared the retrieval effectiveness of our stemmers and of a morphological analyzer on the TREC-2001 data. The best light stemmer was more effective for cross-lan驴guage retrieval than a morphological stemmer which tried to find the root for each word. A repartitioning process consisting of vowel removal followed by clustering using co-occurrence analy驴sis pro驴duced stem classes which were better than no stemming or very light stemming, but still inferior to good light stemming or mor驴phological analysis.