Query-based text normalization selection models for enhanced retrieval accuracy

Authors:
Si-Chi Chin;Rhonda DeCook;W. Nick Street;David Eichmann
Affiliations:
The University of Iowa, Iowa City;The University of Iowa, Iowa City;The University of Iowa, Iowa City;The University of Iowa, Iowa City
Venue:
SS '10 Proceedings of the NAACL HLT 2010 Workshop on Semantic Search
Year:
2010

Citing 8
Cited 0

Little words can make a big difference for text classification

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Stemming algorithms: a case study for detailed evaluation

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Viewing morphology as an inference process

Artificial Intelligence - Special issue on Intelligent internet systems
A tutorial on support vector regression

Statistics and Computing
The TREC 2005 robust track

ACM SIGIR Forum
Don't have a stemmer?: be un+concern+ed

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text normalization transforms words into a base form so that terms from common equivalent classes match. Traditionally, information retrieval systems employ stemming techniques to remove derivational affixes. Depluralization, the transformation of plurals into singular forms, is also used as a low-level text normalization technique to preserve more precise lexical semantics of text. Experiment results suggest that the choice of text normalization technique should be made individually on each topic to enhance information retrieval accuracy. This paper proposes a hybrid approach, constructing a query-based selection model to select the appropriate text normalization technique (stemming, depluralization, or not doing any text normalization). The selection model utilized ambiguity properties extracted from queries to train a composite of Support Vector Regression (SVR) models to predict a text normalization technique that yields the highest Mean Average Precision (MAP). Based on our study, such a selection model holds promise in improving retrieval accuracy.