Efficient prediction-based validation for document clustering

Authors:
Derek Greene;Pádraig Cunningham
Affiliations:
Trinity College, University of Dublin, Dublin 2, Ireland;Trinity College, University of Dublin, Dublin 2, Ireland
Venue:
ECML'06 Proceedings of the 17th European conference on Machine Learning
Year:
2006

Citing 8
Cited 0

Algorithms for clustering data

Algorithms for clustering data
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
A Bootstrap Technique for Nearest Neighbor Classifier Design

IEEE Transactions on Pattern Analysis and Machine Intelligence
Stability-based validation of clustering solutions

Neural Computation
Practical solutions to the problem of diagonal dominance in kernel document clustering

ICML '06 Proceedings of the 23rd international conference on Machine learning
Cluster structure inference based on clustering stability with applications to microarray data analysis

EURASIP Journal on Applied Signal Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, stability-based techniques have emerged as a very promising solution to the problem of cluster validation. An inherent drawback of these approaches is the computational cost of generating and assessing multiple clusterings of the data. In this paper we present an efficient prediction-based validation approach suitable for application to large, high-dimensional datasets such as text corpora. We use kernel clustering to isolate the validation procedure from the original data. Furthermore, we employ a prototype reduction strategy that allows us to work on a reduced kernel matrix, leading to significant computational savings. To ensure that this condensed representation accurately reflects the cluster structures in the data, we propose a density-biased strategy to select the reduced prototypes. This novel validation process is evaluated on real-world text datasets, where it is shown to consistently produce good estimates for the optimal number of clusters.