Constructing transliteration lexicons from web corpora

  • Authors:
  • Jin-Shea Kuo;Ying-Kuei Yang

  • Affiliations:
  • Chung-Hwa Telecommunication, Laboratories, Taiwan, R. O. C. and National Taiwan University of Science and Technology, Taiwan, R. O. C.;National Taiwan University of Science and Technology, Taiwan, R. O. C.

  • Venue:
  • ACLdemo '04 Proceedings of the ACL 2004 on Interactive poster and demonstration sessions
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a novel approach to automating the construction of transliterated-term lexicons. A simple syllable alignment algorithm is used to construct confusion matrices for cross-language syllable-phoneme conversion. Each row in the confusion matrix consists of a set of syllables in the source language that are (correctly or erroneously) matched phonetically and statistically to a syllable in the target language. Two conversions using phoneme-to-phoneme and text-to-phoneme syllabification algorithms are automatically deduced from a training corpus of paired terms and are used to calculate the degree of similarity between phonemes for transliterated-term extraction. In a large-scale experiment using this automated learning process for conversions, more than 200,000 transliterated-term pairs were successfully extracted by analyzing query results from Internet search engines. Experimental results indicate the proposed approach shows promise in transliterated-term extraction.