Effective and efficient classification on a search-engine model

Authors:
Aris Anagnostopoulos;Andrei Broder;Kunal Punera
Affiliations:
Yahoo! Research, 701 First Ave, 94089, Sunnyvale, CA, USA;Yahoo! Research, 701 First Ave, 94089, Sunnyvale, CA, USA;University of Texas at Austin, Department of Electrical and Computer Engineering, 701 First Ave, 78712, Austin, TX, USA
Venue:
Knowledge and Information Systems
Year:
2008

Citing 19
Cited 2

C4.5: programs for machine learning

C4.5: programs for machine learning
Relevance feedback and inference networks

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
The nature of statistical learning theory

The nature of statistical learning theory
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
An algorithm for suffix stripping

Readings in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Extracting query modifications from nonlinear SVMs

Proceedings of the 11th international conference on World Wide Web
Machine Learning

Machine Learning
Modern Information Retrieval

Modern Information Retrieval
On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality

Data Mining and Knowledge Discovery
Enabling Concept-Based Relevance Feedback for Information Retrieval on the WWW

IEEE Transactions on Knowledge and Data Engineering
An evaluation of term dependence models in information retrieval

SIGIR '82 Proceedings of the 5th annual ACM conference on Research and development in information retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies

The VLDB Journal — The International Journal on Very Large Data Bases
Improving Category Specific Web Search by Learning Query Modifications

SAINT '01 Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001)
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
In Defense of One-Vs-All Classification

The Journal of Machine Learning Research
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research

Integrating induction and deduction for noisy data mining

Information Sciences: an International Journal
A schema for ontology-based concept definition and identification

International Journal of Computer Applications in Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the “best” short query that characterizes a document class using operators normally available within search engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. As part of our study, we enhance some of the feature-selection techniques that are found in the literature by forcing the inclusion of terms that are negatively correlated with the target class and by making use of term correlations; we show that both of those techniques can offer significant advantages. Moreover, we show that optimizing the efficiency of query execution by careful selection of terms can further reduce the query costs. More precisely, we show that on our set-up the best 10-term query can achieve 93% of the accuracy of the best SVM classifier (14,000 terms), and if we are willing to tolerate a reduction to 89% of the best SVM, we can build a 10-term query that can be executed more than twice as fast as the best 10-term query.