Approximate pairwise clustering for large data sets via sampling plus extension

Authors:
Liang Wang;Christopher Leckie;Ramamohanarao Kotagiri;James Bezdek
Affiliations:
National Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria 3010, Australia;Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria 3010, Australia;Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria 3010, Australia
Venue:
Pattern Recognition
Year:
2011

Citing 21
Cited 2

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Data clustering and learning

The handbook of brain theory and neural networks
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Pairwise Data Clustering by Deterministic Annealing

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Optimal Cluster Preserving Embedding of Nonmetric Proximity Data

IEEE Transactions on Pattern Analysis and Machine Intelligence
Spectral Grouping Using the Nyström Method

IEEE Transactions on Pattern Analysis and Machine Intelligence
Document Clustering Using Locality Preserving Indexing

IEEE Transactions on Knowledge and Data Engineering
Effective and Efficient Distributed Model-Based Clustering

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Fast Monte Carlo Algorithms for Matrices II: Computing a Low-Rank Approximation to a Matrix

SIAM Journal on Computing
Approximate clustering in very large relational data: Research Articles

International Journal of Intelligent Systems
A survey of kernel and spectral methods for clustering

Pattern Recognition
Selective sampling for approximate clustering of very large data sets

International Journal of Intelligent Systems
Distributed clustering based on sampling local density estimates

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Extending fuzzy and probabilistic clustering to very large data sets

Computational Statistics & Data Analysis
Learning and Matching of Dynamic Shape Manifolds for Human Action Recognition

IEEE Transactions on Image Processing
Survey of clustering algorithms

IEEE Transactions on Neural Networks

Vector quantization based approximate spectral clustering of large datasets

Pattern Recognition
A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval

Knowledge-Based Systems

Quantified Score

Hi-index	0.05

Visualization

Abstract

Pairwise clustering methods have shown great promise for many real-world applications. However, the computational demands of these methods make them impractical for use with large data sets. The contribution of this paper is a simple but efficient method, called eSPEC, that makes clustering feasible for problems involving large data sets. Our solution adopts a ''sampling, clustering plus extension'' strategy. The methodology starts by selecting a small number of representative samples from the relational pairwise data using a selective sampling scheme; then the chosen samples are grouped using a pairwise clustering algorithm combined with local scaling; and finally, the label assignments of the remaining instances in the data are extended as a classification problem in a low-dimensional space, which is explicitly learned from the labeled samples using a cluster-preserving graph embedding technique. Extensive experimental results on several synthetic and real-world data sets demonstrate both the feasibility of approximately clustering large data sets and acceleration of clustering in loadable data sets of our method.