Precomputing search features for fast and accurate query classification

Authors:
Venkatesh Ganti;Arnd Christian König;Xiao Li
Affiliations:
Microsoft Research, Redmond, USA;Microsoft Research, Redmond, USA;Microsoft Research, Redmond, USA
Venue:
Proceedings of the third ACM international conference on Web search and data mining
Year:
2010

Citing 18
Cited 4

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
Advances in frequent itemset mining implementations: report on FIMI'03

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Improving Automatic Query Classification via Semi-Supervised Learning

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Building bridges for web query classification

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Query topic detection for reformulation

Proceedings of the 16th international conference on World Wide Web
Robust classification of rare queries using web knowledge

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Heavy-tailed distributions and multi-keyword queries

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Learning query intent from regularized click graphs

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Search advertising using web relevance feedback

Proceedings of the 17th ACM conference on Information and knowledge management
Improving classification accuracy using automatically extracted training data

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Sources of evidence for vertical selection

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Estimating query performance using class predictions

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Improving similarity measures for short segments of text

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Similarity measures for short segments of text

ECIR'07 Proceedings of the 29th European conference on IR research

Sparse hidden-dynamics conditional random fields for user intent understanding

Proceedings of the 20th international conference on World wide web
Query classification based on index association rule expansion

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Towards the taxonomy-oriented categorization of yellow pages queries

ACM Transactions on Internet Technology (TOIT)
b-bit minwise hashing in practice

Proceedings of the 5th Asia-Pacific Symposium on Internetware

Quantified Score

Hi-index	0.01

Visualization

Abstract

Query intent classification is crucial for web search and advertising. It is known to be challenging because web queries contain less than three words on average, and so provide little signal to base classification decisions on. At the same time, the vocabulary used in search queries is vast: thus, classifiers based on word-occurrence have to deal with a very sparse feature space, and often require large amounts of training data. Prior efforts to address the issue of feature sparseness augmented the feature space using features computed from the results obtained by issuing the query to be classified against a web search engine. However, these approaches induce high latency, making them unacceptable in practice. In this paper, we propose a new class of features that realizes the benefit of search-based features without high latency. These leverage co-occurrence between the query keywords and tags applied to documents in search results, resulting in a significant boost to web query classification accuracy. By pre-computing the tag incidence for a suitably chosen set of keyword-combinations, we are able to generate the features online with low latency and memory requirements. We evaluate the accuracy of our approach using a large corpus of real web queries in the context of commercial search.