Empirical term weighting and expansion frequency

Authors:
Kyoji Umemura;Kenneth W. Church
Affiliations:
Toyohashi University of Technology, Toyohashi Aichi, Japan;AT&T Labs-Research, Florham Park, NJ
Venue:
EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Year:
2000

Citing 3
Cited 2

A new method of weighting query terms for ad-hoc retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Distribution of content words and phrases in text and language modelling

Natural Language Engineering
Empirical estimates of adaptation: the chance of two noriegas is closer to p/2 than p2

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1

Selecting indexing strings using adaptation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A Bayesian mixture model for term re-occurrence and burstiness

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose an empirical method for estimating term weights directly from relevance judgments, avoiding various standard but potentially trouble-some assumptions. It is common to assume, for example, that weights vary with term frequency (tf) and inverse document frequency (idf) in a particular way, e.g., tf .idf, but the fact that there are so many variants of this formula in the literature suggests that there remains considerable uncertainty about these assumptions. Our method is similar to the Berkeley regression method where labeled relevance judgments are fit as a linear combination of (transforms of) tf, idf, etc. Training methods not only improve performance, but also extend naturally to include additional factors such as burstiness and query expansion. The proposed histogram-based training method provides a simple way to model complicated interactions among factors such as tf, idf, burstiness and expansion frequency (a generalization of query expansion). The correct handling of expanded term is realized based on statistical information. Expansion frequency dramatically improves performance from a level comparable to BKJJBIDS, Berkeley's entry in the Japanese NACSIS NTCIR-1 evaluation for short queries, to the level of JCB1, the top system in the evaluation. JCB1 uses sophisticated (and proprietary) natural language processing techniques developed by Just System, a leader in the Japanese word-processing industry. We are encouraged that the proposed method, which is simple to understand and replicate, can reach this level of performance.