Transliteration equivalence using canonical correlation analysis

  • Authors:
  • Raghavendra Udupa;Mitesh M. Khapra

  • Affiliations:
  • Microsoft Research, India;Indian Institute of Technology, Bombay

  • Venue:
  • ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We address the problem of Transliteration Equivalence, i.e. determining whether a pair of words in two different languages (e.g.Auden, ऑडेन) are name transliterations or not. This problem is at the heart of Mining Name Transliterations (MINT) from various sources of multilingual text data including parallel, comparable, and non-comparable corpora and multilingual news streams. MINT is useful in several cross-language tasks including Cross-Language Information Retrieval (CLIR), Machine Translation (MT), and Cross-Language Named Entity Retrieval. We propose a novel approach to Transliteration Equivalence using language-neutral representations of names. The key idea is to consider name transliterations in two languages as two views of the same semantic object and compute a low-dimensional common feature space using Canonical Correlation Analysis (CCA). Similarity of the names in the common feature space forms the basis for classifying a pair of names as transliterations. We show that our approach outperforms state-of-the-art baselines in the CLIR task for Hindi-English (3 collections) and Tamil-English (2 collections).