Statistical transliteration for english-arabic cross language information retrieval

Authors:
Nasreen AbdulJaleel;Leah S. Larkey
Affiliations:
University of Massachusetts, Amhurst, MA;University of Massachusetts, Amhurst, MA
Venue:
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Year:
2003

Citing 7
Cited 38

Algorithms for Arabic name transliteration

IBM Journal of Research and Development
The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Resolving ambiguity for cross-language retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Machine transliteration

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Improved statistical alignment models

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Machine transliteration of names in Arabic text

SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages
Translating names and technical terms in Arabic text

Semitic '98 Proceedings of the Workshop on Computational Approaches to Semitic Languages

A month to topic detection and tracking in Hindi

ACM Transactions on Asian Language Information Processing (TALIP)
Hindi CLIR in thirty days

ACM Transactions on Asian Language Information Processing (TALIP)
Using the web for automated translation extraction in cross-language information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic generation of Japanese–English bilingual thesauri based on bilingual corpora

Journal of the American Society for Information Science and Technology - Research Articles
Weakly supervised named entity transliteration and discovery from multilingual comparable corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Punjabi machine transliteration

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Matching inconsistently spelled names in automatic speech recognizer output for information retrieval

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Named entity transliteration and discovery from multilingual comparable corpora

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
A modified joint source-channel model for transliteration

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
A generic framework for machine transliteration

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A Hybrid Technique for English-Chinese Cross Language Information Retrieval

ACM Transactions on Asian Language Information Processing (TALIP)
Combining probability models and web mining models: a framework for proper name transliteration

Information Technology and Management
Data driven methods for improving mono- and cross-lingual IR performance in noisy environments

Proceedings of the second workshop on Analytics for noisy unstructured text data
English-Arabic proper-noun transliteration-pairs creation

Journal of the American Society for Information Science and Technology
Similarity of Names Across Scripts: Edit Distance Using Learned Costs of N-Grams

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Named entity transliteration for cross-language information retrieval using compressed word format mapping algorithm

Proceedings of the 2nd ACM workshop on Improving non english web searching
Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents

Information Retrieval
Query Translation and Expansion for Searching Normal and OCR-Degraded Arabic Text

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
"They Are Out There, If You Know Where to Look": Mining Transliterations of OOV Query Terms for Cross-Language Information Retrieval

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
MINT: a method for effective and scalable mining of named entity transliterations from large comparable corpora

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Capturing out-of-vocabulary words in Arabic text

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Loss-sensitive discriminative training of machine transliteration models

SRWS '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium
Integration of an Arabic transliteration module into a statistical machine translation system

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Learning better transliterations

Proceedings of the 18th ACM conference on Information and knowledge management
Finding variants of out-of-vocabulary words in Arabic

Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Transliteration alignment

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Language independent transliteration system using phrase based SMT approach on substrings

NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration
ε-extension Hidden Markov Models and weighted transducers for machine transliteration

NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration
Maximum N-gram HMM-based name transliteration: experiment in NEWS 2009 on English-Chinese corpus

NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration
Transliteration for Resource-Scarce Languages

ACM Transactions on Asian Language Information Processing (TALIP)
Hindi-to-Urdu machine translation through transliteration

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Machine transliteration survey

ACM Computing Surveys (CSUR)
Finite-state scriptural translation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Urdu and Hindi: translation and sharing of linguistic resources

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
English to persian transliteration

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Phrase-Based statistical machine translation for a low-density language pair

AI'10 Proceedings of the 23rd Canadian conference on Advances in Artificial Intelligence
Translation techniques in cross-language information retrieval

ACM Computing Surveys (CSUR)
Regularized interlingual projections: evaluation on multilingual transliteration

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Out of vocabulary (OOV) words are problematic for cross language information retrieval. One way to deal with OOV words when the two languages have different alphabets, is to transliterate the unknown words, that is, to render them in the orthography of the second language. In the present study, we present a simple statistical technique to train an English to Arabic transliteration model from pairs of names. We call this a selected n-gram model because a two-stage training procedure first learns which n-gram segments should be added to the unigram inventory for the source language, and then a second stage learns the translation model over this inventory. This technique requires no heuristics or linguistic knowledge of either language. We evaluate the statistically-trained model and a simpler hand-crafted model on a test set of named entities from the Arabic AFP corpus and demonstrate that they perform better than two online translation sources. We also explore the effectiveness of these systems on the TREC 2002 cross language IR task. We find that transliteration either of OOV named entities or of all OOV words is an effective approach for cross language IR.