Measuring similarity between transliterations against noise data

  • Authors:
  • Chung-Chian Hsu;Chien-Hsing Chen;Tien-Teng Shih;Chun-Kai Chen

  • Affiliations:
  • National Yunlin University of Science and Technology, Taiwan, R.O. C;National Yunlin University of Science and Technology, Taiwan, R.O. C;National Yunlin University of Science and Technology, Taiwan, R.O. C;National Yunlin University of Science and Technology, Taiwan, R.O. C

  • Venue:
  • ACM Transactions on Asian Language Information Processing (TALIP)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

When editors of newspapers and magazines translate proper nouns from foreign languages into Chinese, the Chinese translation (termed transliterations) they choose will typically be phonetically similar to the original word. With many different translators working without a common standard, there may be many different Chinese transliterations for the same proper noun, such as using the same sounds but different Chinese characters or even using different sounds and characters. This causes confusion for the reader and, more importantly, leads to incomplete Chinese Web search results. This article investigates the similarity comparison of transliterations as a first step toward solving the incomplete search problem. We devise a method based on comparing digitalized Chinese character (or Hanzi) sounds. Along with four other methods based on comparing grapheme or phoneme similarity, we compare their performance of identifying synonymous transliterations against noise words taken from Web pages. Experimental results indicate that our method surpasses the other methods due to its advantage of containing more discriminative information in sound vectors. The method performing the second best is based on a scheme which assigns similarity between phonemes by carefully considering articulatory features of phonemes, including using multivalued features and placing different weights on the features. Among six pinyin schemes used to romanize Chinese transliterations, the Tongyong scheme outperforms the others.