Textual representations for corpus-based bilingual retrieval

Authors:
Charles K. Nicholas;Paul Mcnamee
Affiliations:
University of Maryland, Baltimore County;University of Maryland, Baltimore County
Venue:
Textual representations for corpus-based bilingual retrieval
Year:
2008

Citing 0
Cited 3

Addressing morphological variation in alphabetic languages

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR

ACM Transactions on Asian Language Information Processing (TALIP)
Managing misspelled queries in IR applications

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The traditional approach to information retrieval is based on using words as the indexing and search terms for documents. However, word-based representations have difficulty addressing morphological processes that confound retrieval, such as inflection, derivation, and compounding. One part of this research investigates alternative methods for representing text, including a method based on overlapping sequences of characters called n-gram tokenization. N-grams are studied in depth and one notable finding is that they achieve a 20% improvement in retrieval effectiveness over words in certain situations. The other focus of this research is improving retrieval performance when foreign language documents must be searched and translation is required. In this scenario bilingual dictionaries are often used to translate user queries; however even among the most commonly spoken languages, for which large bilingual lexicons exist, dictionary-based translation suffers from several significant problems. These include: difficulty handling proper names, which are often missing; issues related to morphological variation since entries, or query terms, may not be lemmatized; and, an inability to robustly handle multiword phrases, especially non-compositional expressions. These problems can be addressed when translation is accomplished using parallel collections, sets of documents available in more than one language. Using parallel texts enables statistical translation of character n-grams rather than words or stemmed words, and with this technique highly effective bilingual retrieval performance is obtained. In this dissertation I present an overview of the field of cross-language information retrieval and then introduce the foundational concepts in n-gram tokenization and corpus-based translation. Then monolingual and bilingual experiments on test sets in 13 languages are described. Analysis of these experiments gives insight into: the relative efficacy of various tokenization methods; reasons for the effectiveness of n-grams; the utility of automated relevance feedback, in both monolingual and bilingual contexts; the interplay between tokenization and translation; and, how translation resource selection and size influence bilingual retrieval.