A phonetic similarity model for automatic extraction of transliteration pairs

Authors:
Jin-Shea Kuo;Haizhou Li;Ying-Kuei Yang
Affiliations:
National Taiwan University of Science and Technology, Taipei, Taiwan;Institute for Infocomm Research, Singapore;National Taiwan University of Science and Technology, Taiwan
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2007

Citing 23
Cited 11

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Translation of web queries using anchor text mining

ACM Transactions on Asian Language Information Processing (TALIP)
Automatic transliteration for Japanese-to-English text retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Machine transliteration

Computational Linguistics
Automatic English-Chinese name transliteration for development of multilingual resources

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
An English to Korean transliteration model of extended Markov window

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
English-to-Korean transliteration using multiple unbounded overlapping phoneme chunks

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Identification and classification of proper nouns in Chinese texts

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Learning phonetic similarity for matching named entity translations and mining new translations

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
An English-Korean transliteration model using pronunciation and contextual rules

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Translating named entities using monolingual and bilingual resources

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Using the web as a bilingual dictionary

DMMT '01 Proceedings of the workshop on Data-driven methods in machine translation - Volume 14
Extracting pronunciation-translated names from Chinese texts using bootstrapping approach

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Backward machine transliteration by learning phonetic similarity

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Acquisition of English-Chinese transliterated word pairs from parallel-aligned texts using a statistical machine transliteration model

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Learning formulation and transformation rules for multilingual named entities

MultiNER '03 Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition - Volume 15
Transliteration of proper names in cross-lingual information retrieval

MultiNER '03 Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition - Volume 15
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
A joint source-channel model for machine transliteration

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Constructing transliteration lexicons from web corpora

ACLdemo '04 Proceedings of the ACL 2004 on Interactive poster and demonstration sessions
Phoneme-Based transliteration of foreign names for OOV problem

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing

Synonymous Chinese Transliterations Retrieval from World Wide Web by Using Association Words

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Harvesting Regional Transliteration Variants with Guided Search

ICCPOL '09 Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy
Transliteration alignment

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
A syllable-based name transliteration system

NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration
Mining Synonymous Transliterations from the World Wide Web

ACM Transactions on Asian Language Information Processing (TALIP)
Transliteration mining with phonetic conflation and iterative training

NEWS '10 Proceedings of the 2010 Named Entities Workshop
Machine transliteration survey

ACM Computing Surveys (CSUR)
Mining named entities with temporally correlated bursts from multilingual web news streams

Proceedings of the fourth ACM international conference on Web search and data mining
Improving name origin recognition with context features and unlabelled data

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Improved transliteration mining using graph reinforcement

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Transliteration mining using large training and test sets

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Quantified Score

Hi-index	0.01

Visualization

Abstract

This article proposes an approach for the automatic extraction of transliteration pairs from Chinese Web corpora. In this approach, we formulate the machine transliteration process using a syllable-based phonetic similarity model which consists of phonetic confusion matrices and a Chinese character n-gram language model. With the phonetic similarity model, the extraction of transliteration pairs becomes a two-step process of recognition followed by validation: First, in the recognition process, we identify the most probable transliteration in the k-neighborhood of a recognized English word. Then, in the validation process, we qualify the transliteration pair candidates with a hypothesis test. We carry out an analytical study on the statistics of several key factors in English-Chinese transliteration to help formulate phonetic similarity modeling. We then conduct both supervised and unsupervised learning of a phonetic similarity model on a development database. The experimental results validate the effectiveness of the phonetic similarity model by achieving an F-measure of 0.739 in supervised learning. The unsupervised learning approach works almost as well as the supervised one, thus allowing us to deploy automatic extraction of transliteration pairs in the Web space.