The optimum clustering framework: implementing the cluster hypothesis

Authors:
Norbert Fuhr;Marc Lechtenfeld;Benno Stein;Tim Gollub
Affiliations:
University of Duisburg-Essen, Duisburg, Germany;University of Duisburg-Essen, Duisburg, Germany;Bauhaus-Universität Weimar, Weimar, Germany;Bauhaus-Universität Weimar, Weimar, Germany
Venue:
Information Retrieval
Year:
2012

Citing 51
Cited 5

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

Journal of Computational and Applied Mathematics
Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Comparison of hierarchic agglomerative clustering methods for document retrieval

The Computer Journal
Probabilistic document indexing from relevance feedback data

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Implementing an efficient minimum capacity cut algorithm

Mathematical Programming: Series A and B
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
The cluster hypothesis revisited

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
Using clustering and classification approaches in interactive retrieval

Information Processing and Management: an International Journal - Special issue on interactivity at the text retrieval conference (TREC)
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Evaluating document clustering for interactive information retrieval

Proceedings of the tenth international conference on Information and knowledge management
Information Retrieval

Information Retrieval
Finding the flow in web site search

Communications of the ACM
The effectiveness of query-specific hierarchic clustering in information retrieval

Information Processing and Management: an International Journal
Faceted metadata for image search and browsing

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Using Noun Phrase Heads to Extract Document Keyphrases

AI '00 Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence
Cluster Validation with Generalized Dunn's Indices

ANNES '95 Proceedings of the 2nd New Zealand Two-Stream International Conference on Artificial Neural Networks and Expert Systems
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
From Retrieval Status Values to Probabilities of Relevance for Advanced IR Applications

Information Retrieval
Entity-based cross-document coreferencing using the Vector Space Model

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Locality preserving indexing for document representation

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Findex: search result categories help users when document ranking fails

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Query-sensitive similarity measures for information retrieval

Knowledge and Information Systems
Regularizing ad hoc retrieval scores

Proceedings of the 14th ACM international conference on Information and knowledge management
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
Keyword-based document clustering

AsianIR '03 Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11
Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering with prior knowledge

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

The Journal of Machine Learning Research
Enhancing the Effectiveness of Clustering with Spectra Analysis

IEEE Transactions on Knowledge and Data Engineering
Text document clustering based on frequent word meaning sequences

Data & Knowledge Engineering
A probability ranking principle for interactive information retrieval

Information Retrieval
The opposite of smoothing: a language model approach to ranking query-specific document clusters

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A cluster-based resampling method for pseudo-relevance feedback

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Spectral geometry for simultaneously clustering and ranking query search results

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A rank-aggregation approach to searching for optimal query-specific clusters

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Efficient Phrase-Based Document Similarity for Clustering

IEEE Transactions on Knowledge and Data Engineering
Clustering XML Documents by Combining Content and Structure

ISISE '08 Proceedings of the 2008 International Symposium on Information Science and Engieering - Volume 01
A comparison of extrinsic clustering evaluation metrics based on formal constraints

Information Retrieval
Dynamicity vs. effectiveness: studying online clustering for scatter/gather

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Score Distributions in Information Retrieval

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Modeling the Score Distributions of Relevant and Non-relevant Documents

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
A New Measure of the Cluster Hypothesis

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
NLP support for faceted navigation in scholarly collections

NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
A uniqueness theorem for clustering

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence

Query-performance prediction and cluster ranking: two sides of the same coin

Proceedings of the 21st ACM international conference on Information and knowledge management
Probabilistic co-relevance for query-sensitive similarity measurement in information retrieval

Information Processing and Management: an International Journal
Ranking document clusters using markov random fields

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
From keywords to keyqueries: content descriptors for the web

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Exploiting Forum Thread Structures to Improve Thread Clustering

Proceedings of the 2013 Conference on the Theory of Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document clustering offers the potential of supporting users in interactive retrieval, especially when users have problems in specifying their information need precisely. In this paper, we present a theoretic foundation for optimum document clustering. Key idea is to base cluster analysis and evalutation on a set of queries, by defining documents as being similar if they are relevant to the same queries. Three components are essential within our optimum clustering framework, OCF: (1) a set of queries, (2) a probabilistic retrieval method, and (3) a document similarity metric. After introducing an appropriate validity measure, we define optimum clustering with respect to the estimates of the relevance probability for the query-document pairs under consideration. Moreover, we show that well-known clustering methods are implicitly based on the three components, but that they use heuristic design decisions for some of them. We argue that with our framework more targeted research for developing better document clustering methods becomes possible. Experimental results demonstrate the potential of our considerations.