Effective measures for inter-document similarity

Authors:
John S. Whissell;Charles L.A. Clarke
Affiliations:
University of Waterloo, Waterloo, ON, Canada;University of Waterloo, Waterloo, ON, Canada
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 16
Cited 0

Document Categorization and Query Generation on the World Wide WebUsing WebACE

Artificial Intelligence Review - Special issue on data mining on the Internet
Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Meta-scoring: automatically evaluating term weighting schemes in IR without precision-recall

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Learning to rank using gradient descent

ICML '05 Proceedings of the 22nd international conference on Machine learning
Feature diversity in cluster ensembles for robust document clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Information theoretic measures for clusterings comparison: is a correction for chance necessary?

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Exploiting Wikipedia as external knowledge for document clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A machine learning approach for improved BM25 retrieval

Proceedings of the 18th ACM conference on Information and knowledge management
Information Retrieval: Implementing and Evaluating Search Engines

Information Retrieval: Implementing and Evaluating Search Engines
Improving document clustering using Okapi BM25 feature weighting

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

While supervised learning-to-rank algorithms have largely supplanted unsupervised query-document similarity measures for search, the exploration of query-document measures by many researchers over many years produced insights that might be exploited in other domains. For example, the BM25 measure substantially and consistently outperforms cosine across many tested environments, and potentially provides retrieval effectiveness approaching that of the best learning-to-rank methods over equivalent features sets. Other measures based on language modeling and divergence from randomness can outperform BM25 in some circumstances. Despite this evidence, cosine remains the prevalent method for determining inter-document similarity for clustering and other applications. However, recent research demonstrates that BM25 terms weights can significantly improve clustering. In this work, we extend that result, presenting and evaluating novel inter-document similarity measures based on BM25, language modeling, and divergence from randomness. In our first experiment we analyze the accuracy of nearest neighborhoods when using our measures. In our second experiment, we analyze using clustering algorithms in conjunction with our measures. Our novel symmetric BM25 and language modeling similarity measures outperform alternative measures in both experiments. This outcome strongly recommends the adoption of these measures, replacing cosine similarity in future work.