Mining English-Chinese Named Entity Pairs from Comparable Corpora

Authors:
Lishuang Li;Peng Wang;Degen Huang;Lian Zhao
Affiliations:
Dalian University of Technology;Dalian University of Technology;Dalian University of Technology;Dalian University of Technology
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2011

Citing 10
Cited 0

Multilingual Information Retrieval Based on Document Alignment Techniques

ECDL '98 Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries
Named entity translation matching and learning: With application for mining unseen translations

ACM Transactions on Information Systems (TOIS)
Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Named entity transliteration with comparable corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Mining new word translations from comparable corpora

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Minimum sample risk methods for language modeling

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Named entity transliteration and discovery from multilingual comparable corpora

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
A Structure-Based Model for Chinese Organization Name Translation

ACM Transactions on Asian Language Information Processing (TALIP)
Unsupervised named entity transliteration using temporal and phonetic correlation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Mining large-scale comparable corpora from Chinese-English news collections

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bilingual Named Entity (NE) pairs are valuable resources for many NLP applications. Since comparable corpora are more accessible, abundant and up-to-date, recent researches have concentrated on mining bilingual lexicons using comparable corpora. Leveraging comparable corpora, this research presents a novel approach to mining English-Chinese NE translations by combining multi-dimension features from various information sources for every possible NE pair, which include the transliteration model, English-Chinese matching, Chinese-English matching, translation model, length, and context vector. These features are integrated into one model with linear combination and minimum sample risk (MSR) algorithm. As for the high type-dependence of NE translation, we integrate different features according to different NE types. We experiment with the above individual feature or integrated features to mine person NE (PN) pairs, location NE (LN) pairs and organization NE (ON) pairs. When using transliteration and length to mine PN pairs, we achieve the best performance of 84.9% (F-score). The LN pairs can be mined with the features of transliteration model, length, translation model, English-Chinese matching and Chinese-English matching. And the best performance is 83.4% (F-score). The ON pairs can be mined with the features of English-Chinese matching and Chinese-English matching. It reaches the best performance with 84.1% (F-score).