Stemming to improve translation lexicon creation form bitexts

  • Authors:
  • Mohamed Abdel Fattah;Fuji Ren;Shingo Kuroiwa

  • Affiliations:
  • Faculty of Engineering, University of Tokushima, Tokushima, Japan;Faculty of Engineering, University of Tokushima, Tokushima, Japan;Faculty of Engineering, University of Tokushima, Tokushima, Japan

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Arabic is a morphologically rich language that presents significant challenges to many natural language processing applications because a word often conveys complex meanings decomposable into several morphemes (i.e. prefix, stem, suffix). By segmenting words into morphemes, we could improve the performance of English/Arabic translation pair's extraction from parallel texts. This paper describes two algorithms and their combination to automatically extract an English/Arabic bilingual dictionary from parallel texts that exist in the Internet archive after using an Arabic light stemmer as a preprocessing step. Before using the Arabic light stemmer, the total system precision and recall were 88.6% and 81.5% respectively, then the system precision an recall increased to 91.6% and 82.6% respectively after applying the Arabic light stemmer on the Arabic documents.The algorithms have certain variables which values can be changed to control the system precision and recall. Like most of the systems do, the accuracy of our system is directly proportional to the number of sentence pairs used. However our system is able to extract translation pairs from a very small parallel corpus. This new system can extract translations from only two sentences in one language and two sentences in the other language if the requirements of the system accomplished. Moreover, this system is able to extract word pairs that are translation of each others, synonyms and the explanation of the word in the other language as well. By controlling the system variables, we could achieve 100% precision for the antnnt bilingual dictionary with a small recall.