Named entity transliteration for cross-language information retrieval using compressed word format mapping algorithm

  • Authors:
  • Srinivasan C. Janarthanam;Sethuramalingam Subramaniam;Udhyakumar Nallasamy

  • Affiliations:
  • University of Edinburgh, Edinburgh, United Kngdm;International Institute of Information Technology (IIIT-H), Hyderabad, India;Carnegie Mellon University, Pittsburgh, PA, USA

  • Venue:
  • Proceedings of the 2nd ACM workshop on Improving non english web searching
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

Transliteration of named entities in user queries is a vital step in any Cross-Language Information Retrieval (CLIR) system. Several methods for transliteration have been proposed till date based on the nature of the languages considered. In this paper, we present a transliteration algorithm for mapping English named entities to their proper Tamil equivalents. Our algorithm employs a grapheme-based model, in which transliteration equivalents are identified by mapping the source language names to their equivalents in a target language database, instead of generating them. The basic principle is to compress the source word into its minimal form and align it across an indexed list of target language words to arrive at the top n-equivalents based on the edit distance. We compare the performance of our approach with a statistical generation approach using Microsoft Research India (MSRI) transliteration corpus. Our approach has proved very effective in terms of accuracy and time.