Some Formal Analysis of Rocchio's Similarity-Based Relevance Feedback Algorithm

Authors:
Zhixiang Chen;Binhai Zhu
Affiliations:
Department of Computer Science, University of Texas-Pan American, 1201 West University Drive, Edinburg, TX 78539, USA. chen@cs.panam.edu;Department of Computer Science, Montana State University, Bozeman, MT 59717, USA. bhz@cs.montana.edu
Venue:
Information Retrieval
Year:
2002

Citing 15
Cited 5

Linear structure in information retrieval

SIGIR '88 Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
How fast can a threshold gate learn?

Proceedings of a workshop on Computational learning theory and natural learning systems (vol. 1) : constraints and prospects: constraints and prospects
The Perceptron algorithm versus Winnow: linear versus logarithmic mistake bounds when few input variables are relevant

Artificial Intelligence - Special issue on relevance
Efficient learning with virtual threshold gates

Information and Computation
Latent semantic indexing: a probabilistic analysis

Journal of Computer and System Sciences - Special issue on the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems
A vector space model for automatic indexing

Communications of the ACM
FEATURES: real-time adaptive feature and document learning for Web search

Journal of the American Society for Information Science and Technology
Modern Information Retrieval

Modern Information Retrieval
Queries and Concept Learning

Machine Learning
Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm

Machine Learning
Queries and Concept Learning

Machine Learning
ImageRover: A Content-Based Image Browser for the World Wide Web

CAIVL '97 Proceedings of the 1997 Workshop on Content-Based Access of Image and Video Libraries (CBAIVL '97)
WebSail: From On-line Learning to Web Search

WISE '00 Proceedings of the First International Conference on Web Information Systems Engineering (WISE'00)-Volume 1 - Volume 1

A survey of content-based image retrieval with high-level semantics

Pattern Recognition
On the complexity of Rocchio's similarity-based relevance feedback algorithm

Journal of the American Society for Information Science and Technology
Online selection of parameters in the rocchio algorithm for identifying interesting news articles

Proceedings of the 10th ACM workshop on Web information and data management
Ego-similarity measurement for relevance feedback

Expert Systems with Applications: An International Journal
Intelligent Social Media Indexing and Sharing Using an Adaptive Indexing Search Engine

ACM Transactions on Intelligent Systems and Technology (TIST)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Rocchio's similarity-based Relevance feedback algorithm, one of the most important query reformation methods in information retrieval, is essentially an adaptive supervised learning algorithm from examples. In spite of its popularity in various applications there is little rigorous analysis of its learning complexity in literature. In this paper we show that in the binary vector space model, if the initial query vector is 0, then for any of the four typical similarities (inner product, dice coefficient, cosine coefficient, and Jaccard coefficient), Rocchio's similarity-based relevance feedback algorithm makes at least in mistakes when used to search for a collection of documents represented by a monotone disjunction of at most ik relevant features (or terms) over the in-dimensional binary vector space {0, 1}in. When an arbitrary initial query vector in {0, 1}in is used, it makes at least (in + ik − 3)/2 mistakes to search for the same collection of documents. The linear lower bounds are independent of the choices of the threshold and coefficients that the algorithm may use in updating its query vector and making its classification.