Latent semantic indexing: a probabilistic analysis
PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering in large graphs and matrices
Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Normalized Cuts and Image Segmentation
IEEE Transactions on Pattern Analysis and Machine Intelligence
Distributed data clustering can be efficient and exact
ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
A Survey of Methods for Scaling Up Inductive Algorithms
Data Mining and Knowledge Discovery
Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms
Data Mining and Knowledge Discovery
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Collective Principal Component Analysis from Distributed, Heterogeneous Data
PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
A Data-Clustering Algorithm on Distributed Memory Multiprocessors
Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
Finding the most interesting patterns in a database quickly by using sequential sampling
The Journal of Machine Learning Research
Spectral Grouping Using the Nyström Method
IEEE Transactions on Pattern Analysis and Machine Intelligence
K-means clustering via principal component analysis
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Scalable density-based distributed clustering
PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Sampling-based sequential subgroup mining
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Effective and Efficient Distributed Model-Based Clustering
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Distributed Uniform Sampling in Unstructured Peer-to-Peer Networks
HICSS '06 Proceedings of the 39th Annual Hawaii International Conference on System Sciences - Volume 09
A tutorial on spectral clustering
Statistics and Computing
Stability Based Sparse LSI/PCA: Incorporating Feature Selection in LSI and PCA
ECML '07 Proceedings of the 18th European conference on Machine Learning
Efficient Data Sampling in Heterogeneous Peer-to-Peer Networks
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Approximate Clustering on Distributed Data Streams
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Adapting the right measures for K-means clustering
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate Distributed K-Means Clustering over a Peer-to-Peer Network
IEEE Transactions on Knowledge and Data Engineering
Distributed clustering based on sampling local density estimates
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Knowledge and Information Systems
Clustering distributed data streams in peer-to-peer environments
Information Sciences: an International Journal
Least squares quantization in PCM
IEEE Transactions on Information Theory
Feature selection for k-means clustering stability: theoretical analysis and an algorithm
Data Mining and Knowledge Discovery
Hi-index | 0.00 |
The scalability of learning algorithms has always been a central concern for data mining researchers, and nowadays, with the rapid increase in data storage capacities and availability, its importance has increased. To this end, sampling has been studied by several researchers in an effort to derive sufficiently accurate models using only small data fractions. In this article we focus on spectral k-means, that is, the k-means approximation as derived by the spectral relaxation, and propose a sequential sampling framework that iteratively enlarges the sample size until the k-means results (objective function and cluster structure) become indistinguishable from the asymptotic (infinite-data) output. In the proposed framework we adopt a commonly applied principle in data mining research that considers the use of minimal assumptions concerning the data generating distribution. This restriction imposes several challenges, mainly related to the efficiency of the sequential sampling procedure. These challenges are addressed using elements of matrix perturbation theory and statistics. Moreover, although the main focus is on spectral k-means, we also demonstrate that the proposed framework can be generalized to handle spectral clustering. The proposed sequential sampling framework is consecutively employed for addressing the distributed clustering problem, where the task is to construct a global model for data that resides in distributed network nodes. The main challenge in this context is related to the bandwidth constraints that are commonly imposed, thus requiring that the distributed clustering algorithm consumes a minimal amount of network load. This illustrates the applicability of the proposed approach, as it enables the determination of a minimal sample size that can be used for constructing an accurate clustering model that entails the distributional characteristics of the data. As opposed to the relevant distributed k-means approaches, our framework takes into account the fact that the choice of the number of clusters has a crucial effect on the required amount of communication. More precisely, the proposed algorithm is able to derive a statistical estimation of the required relative sizes for all possible values of k. This unique feature of our distributed clustering framework enables a network administrator to choose an economic solution that identifies the crude cluster structure of a dataset and not devote excessive network resources for identifying all the “correct” detailed clusters.