Improving document clustering using Okapi BM25 feature weighting

Authors:
John S. Whissell;Charles L. Clarke
Affiliations:
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada N2L 3G1;David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada N2L 3G1
Venue:
Information Retrieval
Year:
2011

Citing 20
Cited 4

Data clustering: a review

ACM Computing Surveys (CSUR)
Document Categorization and Query Generation on the World Wide WebUsing WebACE

Artificial Intelligence Review - Special issue on data mining on the Internet
Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Information Retrieval

Information Retrieval
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Feature diversity in cluster ensembles for robust document clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A tutorial on spectral clustering

Statistics and Computing
Exploiting Wikipedia as external knowledge for document clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
The ineffectiveness of within-document term frequency in text classification

Information Retrieval
Clustering web queries

Proceedings of the 18th ACM conference on Information and knowledge management
Improving retrievability of patents with cluster-based pseudo-relevance feedback documents selection

Proceedings of the 18th ACM conference on Information and knowledge management
Pairwise-adaptive dissimilarity measure for document clustering

Information Sciences: an International Journal
Document clustering of scientific texts using citation contexts

Information Retrieval
Utilising semantic tags in XML clustering

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Clustering for semi-supervised spam filtering

Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Retrieving geo-location of videos with a divide & conquer hierarchical multimodal approach

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Effective measures for inter-document similarity

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A comparison study of clustering models for online review sentiment analysis

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management

Quantified Score

Hi-index	0.01

Visualization

Abstract

We investigate the effect of feature weighting on document clustering, including a novel investigation of Okapi BM25 feature weighting. Using eight document datasets and 17 well-established clustering algorithms we show that the benefit of tf-idf weighting over tf weighting is heavily dependent on both the dataset being clustered and the algorithm used. In addition, binary weighting is shown to be consistently inferior to both tf-idf weighting and tf weighting. We investigate clustering using both BM25 term saturation in isolation and BM25 term saturation with idf, confirming that both are superior to their non-BM25 counterparts under several common clustering quality measures. Finally, we investigate estimation of the k1 BM25 parameter when clustering. Our results indicate that typical values of k1 from other IR tasks are not appropriate for clustering; k1 needs to be higher.