Constructing transliteration lexicons from web corpora

Authors:
Jin-Shea Kuo;Ying-Kuei Yang
Affiliations:
Chung-Hwa Telecommunication, Laboratories, Taiwan, R. O. C. and National Taiwan University of Science and Technology, Taiwan, R. O. C.;National Taiwan University of Science and Technology, Taiwan, R. O. C.
Venue:
ACLdemo '04 Proceedings of the ACL 2004 on Interactive poster and demonstration sessions
Year:
2004

Citing 5
Cited 4

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Machine transliteration

Computational Linguistics
An IR approach for translating new words from nonparallel, comparable texts

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Machine transliteration of names in Arabic text

SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)

Learning transliteration lexicons from the web

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A phonetic similarity model for automatic extraction of transliteration pairs

ACM Transactions on Asian Language Information Processing (TALIP)
Active learning for constructing transliteration lexicons from the Web

Journal of the American Society for Information Science and Technology
Machine transliteration survey

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a novel approach to automating the construction of transliterated-term lexicons. A simple syllable alignment algorithm is used to construct confusion matrices for cross-language syllable-phoneme conversion. Each row in the confusion matrix consists of a set of syllables in the source language that are (correctly or erroneously) matched phonetically and statistically to a syllable in the target language. Two conversions using phoneme-to-phoneme and text-to-phoneme syllabification algorithms are automatically deduced from a training corpus of paired terms and are used to calculate the degree of similarity between phonemes for transliterated-term extraction. In a large-scale experiment using this automated learning process for conversions, more than 200,000 transliterated-term pairs were successfully extracted by analyzing query results from Internet search engines. Experimental results indicate the proposed approach shows promise in transliterated-term extraction.