ACM Computing Surveys (CSUR)
Document Categorization and Query Generation on the World Wide WebUsing WebACE
Artificial Intelligence Review - Special issue on data mining on the Internet
Document clustering using word clusters via the information bottleneck method
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Normalized Cuts and Image Segmentation
IEEE Transactions on Pattern Analysis and Machine Intelligence
Information Retrieval
Evaluation of hierarchical clustering algorithms for document datasets
Proceedings of the eleventh international conference on Information and knowledge management
Frequent term-based text clustering
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Document clustering based on non-negative matrix factorization
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions
The Journal of Machine Learning Research
Feature diversity in cluster ensembles for robust document clustering
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A tutorial on spectral clustering
Statistics and Computing
Exploiting Wikipedia as external knowledge for document clustering
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
The ineffectiveness of within-document term frequency in text classification
Information Retrieval
Proceedings of the 18th ACM conference on Information and knowledge management
Improving retrievability of patents with cluster-based pseudo-relevance feedback documents selection
Proceedings of the 18th ACM conference on Information and knowledge management
Pairwise-adaptive dissimilarity measure for document clustering
Information Sciences: an International Journal
Document clustering of scientific texts using citation contexts
Information Retrieval
Utilising semantic tags in XML clustering
INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
Probabilistic latent semantic analysis
UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Clustering for semi-supervised spam filtering
Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Retrieving geo-location of videos with a divide & conquer hierarchical multimodal approach
Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Effective measures for inter-document similarity
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A comparison study of clustering models for online review sentiment analysis
WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Hi-index | 0.01 |
We investigate the effect of feature weighting on document clustering, including a novel investigation of Okapi BM25 feature weighting. Using eight document datasets and 17 well-established clustering algorithms we show that the benefit of tf-idf weighting over tf weighting is heavily dependent on both the dataset being clustered and the algorithm used. In addition, binary weighting is shown to be consistently inferior to both tf-idf weighting and tf weighting. We investigate clustering using both BM25 term saturation in isolation and BM25 term saturation with idf, confirming that both are superior to their non-BM25 counterparts under several common clustering quality measures. Finally, we investigate estimation of the k1 BM25 parameter when clustering. Our results indicate that typical values of k1 from other IR tasks are not appropriate for clustering; k1 needs to be higher.