A goodness of fit test approach in information retrieval

Authors:
Kostas Fragos;Yannis Maistros
Affiliations:
Department of Electrical and Computer Engineers, National Technical University of Athens Iroon, Zografou, Greece 15780;Department of Electrical and Computer Engineers, National Technical University of Athens Iroon, Zografou, Greece 15780
Venue:
Information Retrieval
Year:
2006

Citing 10
Cited 0

Goodness-of-fit techniques

Goodness-of-fit techniques
Evaluation of an inference network-based retrieval model

ACM Transactions on Information Systems (TOIS) - Special issue on research and development in information retrieval
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval as statistical translation

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A method based on the chi-square test for document classification

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
Probabilistic models of indexing and searching

SIGIR '80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many probabilistic modeling approaches to Information Retrieval we are interested in estimating how well a document model "fits" the user's information need (query model). On the other hand in statistics, goodness of fit tests are well established techniques for assessing the assumptions about the underlying distribution of a data set. Supposing that the query terms are randomly distributed in the various documents of the collection, we actually want to know whether the occurrences of the query terms are more frequently distributed by chance in a particular document. This can be quantified by the so-called goodness of fit tests. In this paper, we present a new document ranking technique based on Chi-square goodness of fit tests. Given the null hypothesis that there is no association between the query terms q and the document d irrespective of any chance occurrences, we perform a Chi-square goodness of fit test for assessing this hypothesis and calculate the corresponding Chi-square values. Our retrieval formula is based on ranking the documents in the collection according to these calculated Chi-square values. The method was evaluated over the entire test collection of TREC data, on disks 4 and 5, using the topics of TREC-7 and TREC-8 (50 topics each) conferences. It performs well, outperforming steadily the classical OKAPI term frequency weighting formula but below that of KL-Divergence from language modeling approach. Despite this, we believe that the technique is an important non-parametric way of thinking of retrieval, offering the possibility to try simple alternative retrieval formulas within goodness-of-fit statistical tests' framework, modeling the data in various ways estimating or assigning any arbitrary theoretical distribution in terms.