Improving name origin recognition with context features and unlabelled data

Authors:
Vladimir Pervouchine;Min Zhang;Ming Liu;Haizhou Li
Affiliations:
Institute for Infocomm Research, A-STAR;Institute for Infocomm Research, A-STAR;Institute for Infocomm Research, A-STAR;Institute for Infocomm Research, A-STAR
Venue:
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Year:
2010

Citing 5
Cited 0

A limited memory algorithm for bound constrained optimization

SIAM Journal on Scientific Computing
A maximum entropy approach to natural language processing

Computational Linguistics
Machine transliteration

Computational Linguistics
Finding ideographic representations of Japanese names written in Latin script via language identification and corpus validation

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
A phonetic similarity model for automatic extraction of transliteration pairs

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We demonstrate the use of context features, namely, names of places, and unlabelled data for the detection of personal name language of origin. While some early work used either rule-based methods or n-gram statistical models to determine the name language of origin, we use the discriminative classification maximum entropy model and view the task as a classification task. We perform bootstrapping of the learning using list of names out of context but with known origin and then using expectation-maximisation algorithm to further train the model on a large corpus of names of unknown origin but with context features. Using a relatively small unlabelled corpus we improve the accuracy of name origin recognition for names written in Chinese from 82.7% to 85.8%, a significant reduction in the error rate. The improvement in F-score for infrequent Japanese names is even greater: from 77.4% without context features to 82.8% with context features.