Handling OOV words in indian-language --- english CLIR

Authors:
Parin Chheda;Manaal Faruqui;Pabitra Mitra
Affiliations:
Computer Science and Engineering, Indian Institute of Technology Kharagpur, India;Computer Science and Engineering, Indian Institute of Technology Kharagpur, India;Computer Science and Engineering, Indian Institute of Technology Kharagpur, India
Venue:
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Year:
2012

Citing 2
Cited 0

Introduction to Information Retrieval

Introduction to Information Retrieval
Approximate String Matching Techniques for Effective CLIR Among Indian Languages

WILF '07 Proceedings of the 7th international workshop on Fuzzy Logic and Applications: Applications of Fuzzy Sets Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Because of the lack of resources Cross-lingual information retrieval is a difficult task for many Indian languages. Google Translate provides an easy way of translation from Indian languages to English but due to lexicon limitations most of the out-of-vocabulory words get transliterated letter by letter along with their suffix resulting in an unusually long string. The resulting string often does not match its intended translation which hurts retrieval. We propose an approach to extract the correct word from such strings using word segmentation along with approximate string matching using Soundex algorithm & Levenshtein distance. We evaluate our approach across three Indian languages and find an average improvement of 5.8% MAP on the FIRE-2010 dataset.