Handling OOV words in indian-language --- english CLIR

  • Authors:
  • Parin Chheda;Manaal Faruqui;Pabitra Mitra

  • Affiliations:
  • Computer Science and Engineering, Indian Institute of Technology Kharagpur, India;Computer Science and Engineering, Indian Institute of Technology Kharagpur, India;Computer Science and Engineering, Indian Institute of Technology Kharagpur, India

  • Venue:
  • ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Because of the lack of resources Cross-lingual information retrieval is a difficult task for many Indian languages. Google Translate provides an easy way of translation from Indian languages to English but due to lexicon limitations most of the out-of-vocabulory words get transliterated letter by letter along with their suffix resulting in an unusually long string. The resulting string often does not match its intended translation which hurts retrieval. We propose an approach to extract the correct word from such strings using word segmentation along with approximate string matching using Soundex algorithm & Levenshtein distance. We evaluate our approach across three Indian languages and find an average improvement of 5.8% MAP on the FIRE-2010 dataset.