A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering

Authors:
Dimitrios Mavroeidis;Panagis Magdalinos
Affiliations:
Radboud University Nijmegen;National and Kapodistrian University of Athens
Venue:
ACM Transactions on Knowledge Discovery from Data (TKDD)
Year:
2012

Citing 29
Cited 1

Latent semantic indexing: a probabilistic analysis

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering in large graphs and matrices

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Distributed data clustering can be efficient and exact

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
A Survey of Methods for Scaling Up Inductive Algorithms

Data Mining and Knowledge Discovery
Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

Data Mining and Knowledge Discovery
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Collective Principal Component Analysis from Distributed, Heterogeneous Data

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
A Data-Clustering Algorithm on Distributed Memory Multiprocessors

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
Finding the most interesting patterns in a database quickly by using sequential sampling

The Journal of Machine Learning Research
Spectral Grouping Using the Nyström Method

IEEE Transactions on Pattern Analysis and Machine Intelligence
K-means clustering via principal component analysis

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Scalable density-based distributed clustering

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Sampling-based sequential subgroup mining

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Effective and Efficient Distributed Model-Based Clustering

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Distributed Uniform Sampling in Unstructured Peer-to-Peer Networks

HICSS '06 Proceedings of the 39th Annual Hawaii International Conference on System Sciences - Volume 09
A tutorial on spectral clustering

Statistics and Computing
Stability Based Sparse LSI/PCA: Incorporating Feature Selection in LSI and PCA

ECML '07 Proceedings of the 18th European conference on Machine Learning
Efficient Data Sampling in Heterogeneous Peer-to-Peer Networks

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Enhancing the Stability of Spectral Ordering with Sparsification and Partial Supervision: Application to Paleontological Data

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Approximate Clustering on Distributed Data Streams

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Adapting the right measures for K-means clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate Distributed K-Means Clustering over a Peer-to-Peer Network

IEEE Transactions on Knowledge and Data Engineering
Distributed clustering based on sampling local density estimates

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Enhancing the stability and efficiency of spectral ordering with partial supervision and feature selection

Knowledge and Information Systems
Clustering distributed data streams in peer-to-peer environments

Information Sciences: an International Journal
Least squares quantization in PCM

IEEE Transactions on Information Theory

Feature selection for k-means clustering stability: theoretical analysis and an algorithm

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

The scalability of learning algorithms has always been a central concern for data mining researchers, and nowadays, with the rapid increase in data storage capacities and availability, its importance has increased. To this end, sampling has been studied by several researchers in an effort to derive sufficiently accurate models using only small data fractions. In this article we focus on spectral k-means, that is, the k-means approximation as derived by the spectral relaxation, and propose a sequential sampling framework that iteratively enlarges the sample size until the k-means results (objective function and cluster structure) become indistinguishable from the asymptotic (infinite-data) output. In the proposed framework we adopt a commonly applied principle in data mining research that considers the use of minimal assumptions concerning the data generating distribution. This restriction imposes several challenges, mainly related to the efficiency of the sequential sampling procedure. These challenges are addressed using elements of matrix perturbation theory and statistics. Moreover, although the main focus is on spectral k-means, we also demonstrate that the proposed framework can be generalized to handle spectral clustering. The proposed sequential sampling framework is consecutively employed for addressing the distributed clustering problem, where the task is to construct a global model for data that resides in distributed network nodes. The main challenge in this context is related to the bandwidth constraints that are commonly imposed, thus requiring that the distributed clustering algorithm consumes a minimal amount of network load. This illustrates the applicability of the proposed approach, as it enables the determination of a minimal sample size that can be used for constructing an accurate clustering model that entails the distributional characteristics of the data. As opposed to the relevant distributed k-means approaches, our framework takes into account the fact that the choice of the number of clusters has a crucial effect on the required amount of communication. More precisely, the proposed algorithm is able to derive a statistical estimation of the required relative sizes for all possible values of k. This unique feature of our distributed clustering framework enables a network administrator to choose an economic solution that identifies the crude cluster structure of a dataset and not devote excessive network resources for identifying all the “correct” detailed clusters.