Large-Scale Parallel Data Clustering
IEEE Transactions on Pattern Analysis and Machine Intelligence
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Distributed data clustering can be efficient and exact
ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
On feature distributional clustering for text categorization
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
Unsupervised document classification using sequential information maximization
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A Fast Parallel Clustering Algorithm for Large Spatial Databases
Data Mining and Knowledge Discovery
Multivariate Information Bottleneck
UAI '01 Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence
The Journal of Machine Learning Research
Information-theoretic co-clustering
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Multi-way distributional clustering via pairwise interactions
ICML '05 Proceedings of the 22nd international conference on Machine learning
Robust information-theoretic clustering
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A scaleable document clustering approach for large document corpora
Information Processing and Management: an International Journal
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A probabilistic framework for relational clustering
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Computational Statistics & Data Analysis
A rate-distortion one-class model and its applications to clustering
Proceedings of the 25th international conference on Machine learning
Topic and role discovery in social networks
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Parallelization of a hierarchical data clustering algorithm using OpenMP
IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Combinatorial markov random fields
ECML'06 Proceedings of the 17th European conference on Machine Learning
Parallel density-based clustering of complex objects
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Improving clustering stability with combinatorial MRFs
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
An efficient clustering algorithm for large-scale topical web pages
Proceedings of the 18th ACM conference on Information and knowledge management
Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing
Hi-index | 0.00 |
The enormous amount and dimensionality of data processed by modern data mining tools require effective, scalable unsupervised learning techniques. Unfortunately, the majority of previously proposed clustering algorithms are either effective or scalable. This paper is concerned with information-theoretic clustering (ITC) that has historically been considered the state-of-the-art in clustering multi-dimensional data. Most existing ITC methods are computationally expensive and not easily scalable. Those few ITC methods that scale well (using, e.g., parallelization) are often outperformed by the others, of an inherently sequential nature. First, we justify this observation theoretically. We then propose data weaving - a novel method for parallelizing sequential clustering algorithms. Data weaving is intrinsically multi-modal - it allows simultaneous clustering of a few types of data (modalities). Finally, we use data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm. In our experimentation with small datasets, DataLoom shows practically identical performance compared to expensive sequential alternatives. On large datasets, however, DataLoom demonstrates significant gains over other parallel clustering methods. To illustrate the scalability, we simultaneously clustered rows and columns of a contingency table with over 120 billion entries.