MINT: a method for effective and scalable mining of named entity transliterations from large comparable corpora

Authors:
Raghavendra Udupa;K. Saravanan;A. Kumaran;Jagadeesh Jagarlamudi
Affiliations:
Microsoft Research India, Bangalore, India;Microsoft Research India, Bangalore, India;Microsoft Research India, Bangalore, India;Microsoft Research India, Bangalore, India
Venue:
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2009

Citing 23
Cited 12

A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Dictionary Methods for Cross-Lingual Information Retrieval

DEXA '96 Proceedings of the 7th International Conference on Database and Expert Systems Applications
A systematic comparison of various statistical alignment models

Computational Linguistics
The Effect of Bilingual Term List Size on Dictionary-Based Cross-Language Information Retrieval

HICSS '03 Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS'03) - Track 4 - Volume 4
Fuzzy translation of cross-lingual spelling variants

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Statistical transliteration for english-arabic cross language information retrieval

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Embedding web-based statistical translation models in cross-language information retrieval

Computational Linguistics - Special issue on web as corpus
Machine transliteration

Computational Linguistics
Proper name translation in cross-language information retrieval

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A pattern matching method for finding noun and proper noun translations from noisy parallel corpora

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Automatic identification of word translations from unrelated English and German corpora

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Empirical studies on the impact of lexical resources on CLIR performance

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
The effect of named entities on effectiveness in cross-language information retrieval evaluation

Proceedings of the 2005 ACM symposium on Applied computing
Translating named entities using monolingual and bilingual resources

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Learning a translation lexicon from monolingual corpora

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
Transliteration of proper names in cross-lingual information retrieval

MultiNER '03 Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition - Volume 15
Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Extracting parallel sub-sentential fragments from non-parallel corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Weakly supervised named entity transliteration and discovery from multilingual comparable corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Mining named entity transliteration equivalents from comparable corpora

Proceedings of the 17th ACM conference on Information and knowledge management
Unsupervised named entity transliteration using temporal and phonetic correlation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Using word dependent transition models in HMM based word alignment for statistical machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
How do named entities contribute to retrieval effectiveness?

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images

Transliteration for Resource-Scarce Languages

ACM Transactions on Asian Language Information Processing (TALIP)
Improving the multilingual user experience of Wikipedia using cross-language name search

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Report of NEWS 2010 transliteration mining shared task

NEWS '10 Proceedings of the 2010 Named Entities Workshop
Transliteration mining with phonetic conflation and iterative training

NEWS '10 Proceedings of the 2010 Named Entities Workshop
Large scale parallel document mining for machine translation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
From bilingual dictionaries to interlingual document representations

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Cross-lingual slot filling from comparable corpora

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Transliteration equivalence using canonical correlation analysis

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Crisis MT: developing a cookbook for MT in crisis situations

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Unsupervised language-independent name translation mining from Wikipedia infoboxes

EMNLP '11 Proceedings of the First Workshop on Unsupervised Learning in NLP
Improved transliteration mining using graph reinforcement

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Transliteration mining using large training and test sets

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Quantified Score

Hi-index	0.03

Visualization

Abstract

In this paper, we address the problem of mining transliterations of Named Entities (NEs) from large comparable corpora. We leverage the empirical fact that multilingual news articles with similar news content are rich in Named Entity Transliteration Equivalents (NETEs). Our mining algorithm, MINT, uses a cross-language document similarity model to align multilingual news articles and then mines NETEs from the aligned articles using a transliteration similarity model. We show that our approach is highly effective on 6 different comparable corpora between English and 4 languages from 3 different language families. Furthermore, it performs substantially better than a state-of-the-art competitor.