Managing misspelled queries in IR applications

Authors:
Jesús Vilares;Manuel Vilares;Juan Otero
Affiliations:
Department of Computer Science, University of A Coruña Campus de Elviña, 15071 A Coruña, Spain;Department of Computer Science, University of Vigo Campus As Lagoas s/n, 32004 Ourense, Spain;Department of Computer Science, University of Vigo Campus As Lagoas s/n, 32004 Ourense, Spain
Venue:
Information Processing and Management: an International Journal
Year:
2011

Citing 68
Cited 0

Compression of index term dictionary in an inverted-file-orientated database: some effective algorithms

Information Processing and Management: an International Journal
Effective text compression with simultaneous digram and trigram encoding

Journal of Information Science
Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Results of applying probabilistic IR to OCR text

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Finding approximate matches in large lexicons

Software—Practice & Experience
Phonetic string matching: lessons from information retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Using n-grams for Korean text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Comparing representations in Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Interaction in information retrieval: selection and effectiveness of search terms

Journal of the American Society for Information Science
The future of Internet search (keynote address)

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A Winnow-Based Approach to Context-Sensitive Spelling Correction

Machine Learning - Special issue on natural language learning
A Study of Methods for Systematically Abbreviating English Words and Names

Journal of the ACM (JACM)
Real life, real users, and real needs: a study and analysis of user queries on the web

Information Processing and Management: an International Journal
On the use of words and n-grams for Chinese information retrieval

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Experiments in spoken document retrieval using phoneme n-grams

Speech Communication - Special issue on accessing information in spoken audio
Syntax-directed least-errors analysis for context-free languages: a practical approach

Communications of the ACM
A technique for computer detection and correction of spelling errors

Communications of the ACM
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Improved string matching under noisy channel conditions

Proceedings of the tenth international conference on Information and knowledge management
Term selection for searching printed Arabic

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
Exploiting syntactic analysis of queries for information retrieval

Data & Knowledge Engineering
Automatic Rule Acquisition for Spelling Correction

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Common Solution for Tokenization and Part-of-Speech Tagging

TSD '02 Proceedings of the 5th International Conference on Text, Speech and Dialogue
Typographical Nearest-Neighbor Search in a Finite-State Lexicon and Its Application to Spelling Correction

CIAA '01 Revised Papers from the 6th International Conference on Implementation and Application of Automata
Cross-language information retrieval: experiments based on CLEF 2000 corpora

Information Processing and Management: an International Journal
A systematic comparison of various statistical alignment models

Computational Linguistics
Single n-gram stemming

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Monolingual Document Retrieval for European Languages

Information Retrieval
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Chinese word segmentation and its effect on information retrieval

Information Processing and Management: an International Journal
Towards a single proposal in spelling correction

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A spelling correction program based on a noisy channel model

COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 2
Using N-grams for Arabic text searching

Journal of the American Society for Information Science and Technology
Correcting real-word spelling errors by restoring lexical cohesion

Natural Language Engineering
Using contextual spelling correction to improve retrieval effectiveness in degraded text collections

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Pronunciation modeling for improved spelling correction

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
An improved error model for noisy channel spelling correction

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Character contiguity in N-gram-based word matching: the case for Arabic text searching

Information Processing and Management: an International Journal
Fast Approximate Search in Large Dictionaries

Computational Linguistics
Spelling correction in the PubMed search engine

Information Retrieval
Introduction to Automata Theory, Languages, and Computation (3rd Edition)

Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Exploring distributional similarity based models for query spelling correction

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Text induced spelling correction

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Effect of OCR error correction on Arabic retrieval

Information Retrieval
A unified and discriminative model for query refinement

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A study of query length

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Data driven methods for improving mono- and cross-lingual IR performance in noisy environments

Proceedings of the second workshop on Analytics for noisy unstructured text data
An empirical study of gene synonym query expansion in biomedical information retrieval

Information Retrieval
Evaluation of query expansion using MeSH in PubMed

Information Retrieval
Exploring criteria for successful query expansion in the genomic domain

Information Retrieval
n-Gram characterization of genomic islands in bacterial genomes

Computer Methods and Programs in Biomedicine
Analysis of long queries in a large scale search log

Proceedings of the 2009 workshop on Web Search Click Data
Search Engines: Information Retrieval in Practice

Search Engines: Information Retrieval in Practice
Ordering the suggestions of a spellchecker without using context*

Natural Language Engineering
Fast error-tolerant search on very large texts

Proceedings of the 2009 ACM symposium on Applied Computing
Textual representations for corpus-based bilingual retrieval

Textual representations for corpus-based bilingual retrieval
Addressing morphological variation in alphabetic languages

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
The linguistic structure of English web-search queries

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Japanese query alteration based on semantic similarity

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Effective spelling correction in web queries and run-time DB construction

Proceedings of the 2009 International Conference on Hybrid Information Technology
Analyzing and evaluating query reformulation strategies in web search logs

Proceedings of the 18th ACM conference on Information and knowledge management
Mining linguistic cues for query expansion: applications to drug interaction search

Proceedings of the 18th ACM conference on Information and knowledge management
Discovery of term variation in Japanese web search queries

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Information Processing and Management: an International Journal
Contextual spelling correction

EUROCAST'07 Proceedings of the 11th international conference on Computer aided systems theory
A comparison of language identification approaches on short, query-style texts

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Our work concerns the design of robust information retrieval environments that can successfully handle queries containing misspelled words. Our aim is to perform a comparative analysis of the efficacy of two possible strategies that can be adopted. A first strategy involves those approaches based on correcting the misspelled query, thus requiring the integration of linguistic information in the system. This solution has been studied from complementary standpoints, according to whether contextual information of a linguistic nature is integrated in the process or not, the former implying a higher degree of complexity. A second strategy involves the use of character n-grams as the basic indexing unit, which guarantees the robustness of the information retrieval process whilst at the same time eliminating the need for a specific query correction stage. This is a knowledge-light and language-independent solution which requires no linguistic information for its application. Both strategies have been subjected to experimental testing, with Spanish being used as the case in point. This is a language which, unlike English, has a great variety of morphological processes, making it particularly sensitive to spelling errors. The results obtained demonstrate that stemming-based approaches are highly sensitive to misspelled queries, particularly with short queries. However, such a negative impact can be effectively reduced by the use of correction mechanisms during querying, particularly in the case of context-based correction, since more classical approaches introduce too much noise when query length is increased. On the other hand, our n-gram based strategy shows a remarkable robustness, with average performance losses appreciably smaller than those for stemming.