Stemming to improve translation lexicon creation form bitexts

Authors:
Mohamed Abdel Fattah;Fuji Ren;Shingo Kuroiwa
Affiliations:
Faculty of Engineering, University of Tokushima, Tokushima, Japan;Faculty of Engineering, University of Tokushima, Tokushima, Japan;Faculty of Engineering, University of Tokushima, Tokushima, Japan
Venue:
Information Processing and Management: an International Journal
Year:
2006

Citing 19
Cited 3

Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
On arabic search: improving the retrieval effectiveness via a light stemming approach

Proceedings of the eleventh international conference on Information and knowledge management
Building Bilingual Dictionaries from Parallel Web Documents

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
A systematic comparison of various statistical alignment models

Computational Linguistics
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Text-translation alignment

Computational Linguistics - Special issue on using large corpora: I
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Hindi CLIR in thirty days

ACM Transactions on Asian Language Information Processing (TALIP)
Internet Archive as a Source of Bilingual Dictionary

ITCC '04 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2 - Volume 2
A DP based search using monotone alignments in statistical translation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
A DP based search algorithm for statistical machine translation

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Construction of a bilingual dictionary intermediated by a third language

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
HMM-based word alignment in statistical translation

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Reliable measures for aligning Japanese-English news articles and sentences

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A probability model to improve word alignment

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Effective phrase translation extraction from alignment models

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Unsupervised learning of Arabic stemming using a parallel corpus

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Language model based arabic word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach

AsianIR '03 Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11

Sentence alignment using P-NNT and GMM

Computer Speech and Language
English-Arabic proper-noun transliteration-pairs creation

Journal of the American Society for Information Science and Technology
Text-based English-Arabic sentence alignment

ICIC'06 Proceedings of the 2006 international conference on Intelligent computing: Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Arabic is a morphologically rich language that presents significant challenges to many natural language processing applications because a word often conveys complex meanings decomposable into several morphemes (i.e. prefix, stem, suffix). By segmenting words into morphemes, we could improve the performance of English/Arabic translation pair's extraction from parallel texts. This paper describes two algorithms and their combination to automatically extract an English/Arabic bilingual dictionary from parallel texts that exist in the Internet archive after using an Arabic light stemmer as a preprocessing step. Before using the Arabic light stemmer, the total system precision and recall were 88.6% and 81.5% respectively, then the system precision an recall increased to 91.6% and 82.6% respectively after applying the Arabic light stemmer on the Arabic documents.The algorithms have certain variables which values can be changed to control the system precision and recall. Like most of the systems do, the accuracy of our system is directly proportional to the number of sentence pairs used. However our system is able to extract translation pairs from a very small parallel corpus. This new system can extract translations from only two sentences in one language and two sentences in the other language if the requirements of the system accomplished. Moreover, this system is able to extract word pairs that are translation of each others, synonyms and the explanation of the word in the other language as well. By controlling the system variables, we could achieve 100% precision for the antnnt bilingual dictionary with a small recall.