Parallel Spectral Clustering in Distributed Systems

Authors:
Wen-Yen Chen;Yangqiu Song;Hongjie Bai;Chih-Jen Lin;Edward Y. Chang
Affiliations:
Yahoo! Inc,, Sunnyvale;Microsoft Research Asia, Beijing;Google Information Technology (China) Co, Ltd., Beijing;National Taiwan University, Taipei;Google Research, Palo Alto
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
2011

Citing 0
Cited 29

Long distance bigram models applied to word clustering

Pattern Recognition
A Very Fast Method for Clustering Big Text Datasets

Proceedings of the 2010 conference on ECAI 2010: 19th European Conference on Artificial Intelligence
Clustered Nyström method for large scale manifold learning and dimension reduction

IEEE Transactions on Neural Networks
On a strategy for spectral clustering with parallel computation

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Large-scale cross-document coreference using distributed inference and hierarchical models

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
A new anticorrelation-based spectral clustering formulation

ACIVS'11 Proceedings of the 13th international conference on Advanced concepts for intelligent vision systems
Leveraging social media networks for classification

Data Mining and Knowledge Discovery
SBV-Cut: Vertex-cut based graph partitioning using structural balance vertices

Data & Knowledge Engineering
Vector quantization based approximate spectral clustering of large datasets

Pattern Recognition
A conversation with Dr. Edward Y. Chang

ACM SIGKDD Explorations Newsletter
Fast nonnegative matrix tri-factorization for large-scale data co-clustering

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Distributed approximate spectral clustering for large-scale datasets

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Automatic taxonomy construction from keywords

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
A comparative study of efficient initialization methods for the k-means clustering algorithm

Expert Systems with Applications: An International Journal
Maximum margin clustering on evolutionary data

Proceedings of the 21st ACM international conference on Information and knowledge management
Constraint projections for semi-supervised affinity propagation

Knowledge-Based Systems
Relational co-clustering via manifold ensemble learning

Proceedings of the 21st ACM international conference on Information and knowledge management
ClusterFA: a memory-efficient DFA structure for network intrusion detection

Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security
p-PIC: Parallel power iteration clustering for big data

Journal of Parallel and Distributed Computing
MicroClAn: Microarray clustering analysis

Journal of Parallel and Distributed Computing
Interpreting pedestrian behaviour by visualising and clustering movement data

W2GIS'13 Proceedings of the 12th international conference on Web and Wireless Geographical Information Systems
Fast global k-means clustering based on local geometrical information

Information Sciences: an International Journal
Biomedical time series clustering based on non-negative sparse coding and probabilistic topic model

Computer Methods and Programs in Biomedicine
Locally discriminative spectral clustering with composite manifold

Neurocomputing
Robust tensor clustering with non-greedy maximization

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Large-scale spectral clustering on graphs

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Discriminative Orthogonal Nonnegative matrix factorization with flexibility for data representation

Expert Systems with Applications: An International Journal
Combining supervised and unsupervised models via unconstrained probabilistic embedding

Information Sciences: an International Journal
Local information-based fast approximate spectral clustering

Pattern Recognition Letters

Quantified Score

Hi-index	0.15

Visualization

Abstract

Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nyström method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a document data set of 193,844 instances and a photo data set of 2,121,863, we show that our parallel algorithm can effectively handle large problems.