Managing misspelled queries in IR applications

  • Authors:
  • Jesús Vilares;Manuel Vilares;Juan Otero

  • Affiliations:
  • Department of Computer Science, University of A Coruña Campus de Elviña, 15071 A Coruña, Spain;Department of Computer Science, University of Vigo Campus As Lagoas s/n, 32004 Ourense, Spain;Department of Computer Science, University of Vigo Campus As Lagoas s/n, 32004 Ourense, Spain

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Our work concerns the design of robust information retrieval environments that can successfully handle queries containing misspelled words. Our aim is to perform a comparative analysis of the efficacy of two possible strategies that can be adopted. A first strategy involves those approaches based on correcting the misspelled query, thus requiring the integration of linguistic information in the system. This solution has been studied from complementary standpoints, according to whether contextual information of a linguistic nature is integrated in the process or not, the former implying a higher degree of complexity. A second strategy involves the use of character n-grams as the basic indexing unit, which guarantees the robustness of the information retrieval process whilst at the same time eliminating the need for a specific query correction stage. This is a knowledge-light and language-independent solution which requires no linguistic information for its application. Both strategies have been subjected to experimental testing, with Spanish being used as the case in point. This is a language which, unlike English, has a great variety of morphological processes, making it particularly sensitive to spelling errors. The results obtained demonstrate that stemming-based approaches are highly sensitive to misspelled queries, particularly with short queries. However, such a negative impact can be effectively reduced by the use of correction mechanisms during querying, particularly in the case of context-based correction, since more classical approaches introduce too much noise when query length is increased. On the other hand, our n-gram based strategy shows a remarkable robustness, with average performance losses appreciably smaller than those for stemming.