Transliteration mining using large training and test sets

Authors:
Ali El Kahki;Kareem Darwish;Ahmed Saad El Din;Mohamed Abd El-Wahab
Affiliations:
Qatar Foundation, Doha, Qatar;Qatar Foundation, Doha, Qatar;Qatar Foundation, Doha, Qatar;Cairo University, Cairo, Egypt
Venue:
NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Year:
2012

Citing 26
Cited 0

Automatic transliteration for Japanese-to-English text retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Extracting named entity translingual equivalence with limited resources

ACM Transactions on Asian Language Information Processing (TALIP)
Acquisition of English-Chinese transliterated word pairs from parallel-aligned texts using a statistical machine transliteration model

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Paraphrasing with bilingual parallel corpora

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Extracting parallel sub-sentential fragments from non-parallel corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Learning transliteration lexicons from the web

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Named entity transliteration and discovery from multilingual comparable corpora

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Mining the Web for Transliteration Lexicons: Joint-Validation Approach

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
A phonetic similarity model for automatic extraction of transliteration pairs

ACM Transactions on Asian Language Information Processing (TALIP)
"They Are Out There, If You Know Where to Look": Mining Transliterations of OOV Query Terms for Cross-Language Information Retrieval

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
MINT: a method for effective and scalable mining of named entity transliterations from large comparable corpora

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Using word dependent transition models in HMM based word alignment for statistical machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Automated mining of names using parallel Hindi-English corpus

ALR7 Proceedings of the 7th Workshop on Asian Language Resources
Hitting the right paraphrases in good time

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Extracting parallel sentences from comparable corpora using document level alignment

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Report of NEWS 2010 transliteration mining shared task

NEWS '10 Proceedings of the 2010 Named Entities Workshop
Transliteration generation and mining with limited training resources

NEWS '10 Proceedings of the 2010 Named Entities Workshop
Transliteration mining with phonetic conflation and iterative training

NEWS '10 Proceedings of the 2010 Named Entities Workshop
Language independent transliteration mining system using finite state automata framework

NEWS '10 Proceedings of the 2010 Named Entities Workshop
Mining name translations from entity graph mapping

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Hashing-based approaches to spelling correction of personal names

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Extracting parallel phrases from comparable data

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Transliteration equivalence using canonical correlation analysis

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Improved transliteration mining using graph reinforcement

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Much previous work on Transliteration Mining (TM) was conducted on short parallel snippets using limited training data, and successful methods tended to favor recall. For such methods, increasing training data may impact precision and application on large comparable texts may impact precision and recall. We adapt a state-of-the-art TM technique with the best reported scores on the ACL 2010 NEWS workshop dataset, namely graph reinforcement, to work with large training sets. The method models observed character mappings between language pairs as a bipartite graph and unseen mappings are induced using random walks. Increasing training data yields more correct initial mappings but induced mappings become more error prone. We introduce parameterized exponential penalty to the formulation of graph reinforcement and we estimate the proper parameters for training sets of varying sizes. The new formulation led to sizable improvements in precision. Mining from large comparable texts leads to the presence of phonetically similar words in target and source texts that may not be transliterations or may adversely impact candidate ranking. To overcome this, we extracted related segments that have high translation overlap, and then we performed TM on them. Segment extraction produced significantly higher precision for three different TM methods.