Clustering web queries

Authors:
John S. Whissell;Charles L.A. Clarke;Azin Ashkan
Affiliations:
University of Waterloo, Waterloo, ON, Canada;University of Waterloo, Waterloo, ON, Canada;University of Waterloo, Waterloo, ON, Canada
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 20
Cited 2

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Data clustering: a review

ACM Computing Surveys (CSUR)
Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Agglomerative clustering of a search engine query log

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Pattern Recognition with Fuzzy Objective Function Algorithms

Pattern Recognition with Fuzzy Objective Function Algorithms
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
A taxonomy of web search

ACM SIGIR Forum
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Understanding user goals in web search

Proceedings of the 13th international conference on World Wide Web
Automatic identification of user goals in Web search

WWW '05 Proceedings of the 14th international conference on World Wide Web
Detecting online commercial intention (OCI)

Proceedings of the 15th international conference on World Wide Web
Meta Clustering

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Enhancing text clustering by leveraging Wikipedia semantics

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Analysis of web search engine query session and clicked documents

WebKDD'06 Proceedings of the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
A sober look at clustering stability

COLT'06 Proceedings of the 19th annual conference on Learning Theory
The intention behind web queries

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Applications of web query mining

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Likelihood based hierarchical clustering

IEEE Transactions on Signal Processing

Characterizing large-scale use of a direct manipulation application in the wild

Proceedings of Graphics Interface 2010
Improving document clustering using Okapi BM25 feature weighting

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Despite the wide applicability of clustering methods, their evaluation remains a problem. In this paper, we present a metric for the evaluation of clustering methods. The data set to be clustered is viewed as a sample from a larger population, with clustering quality measured in terms of our predicted ability to discriminate between members of this population. We measure this property by training a classifier to recognize each cluster and measuring the accuracy of this classifier, normalized by a notion of expected accuracy. To demonstrate the applicability of this metric we apply it to Web queries. We investigated a commercially oriented data set of 1700 queries and a general data set of 4000 queries. Both sets are taken from the logs of a commercial Web search engine. Clustering is based on the contents of search engine result pages generated by executing the queries on the search engine from which they were taken. Multiple clustering algorithms are crossed with various weighting schemes to produce multiple clusterings of each query set. Our metric is used evaluate these clusterings. The results on the commercially oriented data set are compared to two pre-existing manual labelings, and are also used in an ad clickthrough experiment.