Clustering and classifying person names by origin

Authors:
Fei Huang;Stephan Vogel;Alex Waibel
Affiliations:
Language Technologies Institute, School of Computer Sciences, Carnegie Mellon University, Pittsburgh, PA;Language Technologies Institute, School of Computer Sciences, Carnegie Mellon University, Pittsburgh, PA;Language Technologies Institute, School of Computer Sciences, Carnegie Mellon University, Pittsburgh, PA
Venue:
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3
Year:
2005

Citing 6
Cited 1

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Foundations of statistical natural language processing

Foundations of statistical natural language processing
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Machine transliteration

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Transliteration of proper names in cross-lingual information retrieval

MultiNER '03 Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition - Volume 15
Translating names and technical terms in Arabic text

Semitic '98 Proceedings of the Workshop on Computational Approaches to Semitic Languages

Machine transliteration survey

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In natural language processing, information about a person's geographical origin is an important feature for name entity transliteration and question answering. We propose a language-independent name origin clustering and classification framework. Provided with a small amount of bilingual name translation pairs with labeled origins, we measure origin similarities based on the perplexities of name character language and translation models. We group similar origins into clusters, then train a Bayesian classifier with different features. It achieves 84% classification accuracy with source names only, and 91% with both source and target name pairs. We apply the origin clustering and classification technique to a name transliteration task. The cluster-specific transliteration model dramatically improves the transliteration accuracy from 3.8% to 55%, reducing the transliteration character error rate from 50.3 to 13.5. Adding more unlabeled name pairs to the cluster-specific name transliteration model further improves the transliteration accuracy.