Clustering Large Graphs via the Singular Value Decomposition

Authors:
P. Drineas;A. Frieze;R. Kannan;S. Vempala;V. Vinay
Affiliations:
Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY 12180. drinep@cs.rpi.edu;Department of Mathematical Sciences, Carnegie Mellon University, Pittsburgh, PA 15213. alan@random.math.cmu.edu;Computer Science Department, Yale University, New Haven, CT 06520. kannan@cs.yale.edu;Department of Mathematics, M.I.T., Cambridge, MA 02139. vempala@math.mit.edu;Indian Institute of Science, Bangalore, India. vinay@csa.iisc.ernet.in
Venue:
Machine Learning
Year:
2004

Citing 18
Cited 58

Algorithms for clustering data

Algorithms for clustering data
Vector quantization and signal compression

Vector quantization and signal compression
The symmetric eigenvalue problem

The symmetric eigenvalue problem
Property testing and its connection to learning and approximation

Journal of the ACM (JACM)
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
The analysis of a simple k-means clustering algorithm

Proceedings of the sixteenth annual symposium on Computational geometry
Latent semantic indexing: a probabilistic analysis

Journal of Computer and System Sciences - Special issue on the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems
A sharp threshold in proof complexity

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Spectral analysis of data

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Polynomial-time approximation schemes for geometric min-sum median clustering

Journal of the ACM (JACM)
A constant-factor approximation algorithm for the k-median problem

Journal of Computer and System Sciences - STOC 1999
Clustering Categorical Data: An Approach Based on Dynamical Systems

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Sampling lower bounds via information theory

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Fast Monte-Carlo Algorithms for finding low-rank approximations

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Improved Combinatorial Algorithms for the Facility Location and k-Median Problems

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Primal-Dual Approximation Algorithms for Metric Facility Location and k-Median Problems

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Fast Monte-Carlo Algorithms for Approximate Matrix Multiplication

FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
The complexity of massive data set computations

The complexity of massive data set computations

Fast monte-carlo algorithms for finding low-rank approximations

Journal of the ACM (JACM)
Subproblem optimization by gene correlation with singular value decomposition

GECCO '05 Proceedings of the 7th annual conference on Genetic and evolutionary computation
Spectral techniques for graph bisection in genetic algorithms

Proceedings of the 8th annual conference on Genetic and evolutionary computation
Online clustering of parallel data streams

Data & Knowledge Engineering
Latent linkage semantic kernels for collective classification of link data

Journal of Intelligent Information Systems
Exploiting asymmetry in hierarchical topic extraction

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Fast computation of low-rank matrix approximations

Journal of the ACM (JACM)
Sampling from large matrices: An approach through geometric functional analysis

Journal of the ACM (JACM)
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Spectral clustering in telephone call graphs

Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis
A continuous facility location problem and its application to a clustering problem

Proceedings of the 2008 ACM symposium on Applied computing
Approximation algorithms for co-clustering

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Graph partitioning into isolated, high conductance clusters: theory, computation and applications to preconditioning

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
An approximation ratio for biclustering

Information Processing Letters
PCA and SVD with nonnegative loadings

Pattern Recognition
The Planar k-Means Problem is NP-Hard

WALCOM '09 Proceedings of the 3rd International Workshop on Algorithms and Computation
Graph nodes clustering with the sigmoid commute-time kernel: A comparative study

Data & Knowledge Engineering
A global optimization method for semi-supervised clustering

Data Mining and Knowledge Discovery
NP-hardness of Euclidean sum-of-squares clustering

Machine Learning
Latent space domain transfer between high dimensional overlapping distributions

Proceedings of the 18th international conference on World wide web
Spectral Clustering in Social Networks

Advances in Web Mining and Web Usage Analysis
Privacy-Preserving Clustering with High Accuracy and Low Time Complexity

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
A spectral-based clustering algorithm for categorical data using data summaries

Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors
Spectral Algorithms

Foundations and Trends® in Theoretical Computer Science
Random projection trees for vector quantization

IEEE Transactions on Information Theory
Singular value decomposition in additive, multiplicative, and logistic forms

Pattern Recognition
Data clustering: 50 years beyond K-means

Pattern Recognition Letters
Development of a GT4-based resource broker service: an application to on-demand weather and marine forecasting

GPC'07 Proceedings of the 2nd international conference on Advances in grid and pervasive computing
Graph nodes clustering based on the commute-time kernel

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Eigenvector-based clustering using aggregated similarity matrices

Proceedings of the 2010 ACM Symposium on Applied Computing
On the isoperimetric spectrum of graphs and its approximations

Journal of Combinatorial Theory Series B
A robust iterative refinement clustering algorithm with smoothing search space

Knowledge-Based Systems
Spectral methods for matrices and tensors

Proceedings of the forty-second ACM symposium on Theory of computing
Traffic-based network clustering

Proceedings of the 6th International Wireless Communications and Mobile Computing Conference
Flexible constrained spectral clustering

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Computational systems biology

Algorithms and theory of computation handbook
Stochastic algorithms in linear algebra: beyond the Markov chains and von Neumann-Ulam scheme

NMA'10 Proceedings of the 7th international conference on Numerical methods and applications
The complexity status of problems related to sparsest cuts

IWOCA'10 Proceedings of the 21st international conference on Combinatorial algorithms
Original Article: Sparsified Randomization algorithms for low rank approximations and applications to integral equations and inhomogeneous random field simulation

Mathematics and Computers in Simulation
Independent Component Analysis Based Seeding Method for K-Means Clustering

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 03
Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions

SIAM Review
Data reduction for weighted and outlier-resistant clustering

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Personalized news categorization through scalable text classification

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
On the complexity of several haplotyping problems

WABI'05 Proceedings of the 5th International conference on Algorithms in Bioinformatics
A fast random sampling algorithm for sparsifying matrices

APPROX'06/RANDOM'06 Proceedings of the 9th international conference on Approximation Algorithms for Combinatorial Optimization Problems, and 10th international conference on Randomization and Computation
Randomized Algorithms for Matrices and Data

Foundations and Trends® in Machine Learning
The complexity of finding uniform sparsest cuts in various graph classes

Journal of Discrete Algorithms
The planar k-means problem is NP-hard

Theoretical Computer Science
Collaborative similarity measure for intra graph clustering

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications
Drawing Large Graphs by Low-Rank Stress Majorization

Computer Graphics Forum
The singular values and vectors of low rank perturbations of large rectangular random matrices

Journal of Multivariate Analysis
The effectiveness of lloyd-type methods for the k-means problem

Journal of the ACM (JACM)
Clustering genome data based on approximate matching

International Journal of Data Analysis Techniques and Strategies
Low rank approximation and regression in input sparsity time

Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Fuzzy and hard clustering analysis for thyroid disease

Computer Methods and Programs in Biomedicine
Anomaly detection in large-scale data stream networks

Data Mining and Knowledge Discovery
On constrained spectral clustering and its applications

Data Mining and Knowledge Discovery
Matrix Recipes for Hard Thresholding Methods

Journal of Mathematical Imaging and Vision

Quantified Score

Hi-index	0.06

Visualization

Abstract

We consider the problem of partitioning a set of m points in the n-dimensional Euclidean space into k clusters (usually m and n are variable, while k is fixed), so as to minimize the sum of squared distances between each point and its cluster center. This formulation is usually the objective of the k-means clustering algorithm (Kanungo et al. (2000)). We prove that this problem in NP-hard even for k = 2, and we consider a continuous relaxation of this discrete problem: find the k-dimensional subspace V that minimizes the sum of squared distances to V of the m points. This relaxation can be solved by computing the Singular Value Decomposition (SVD) of the m × n matrix A that represents the m points; this solution can be used to get a 2-approximation algorithm for the original problem. We then argue that in fact the relaxation provides a generalized clustering which is useful in its own right.Finally, we show that the SVD of a random submatrix—chosen according to a suitable probability distribution—of a given matrix provides an approximation to the SVD of the whole matrix, thus yielding a very fast randomized algorithm. We expect this algorithm to be the main contribution of this paper, since it can be applied to problems of very large size which typically arise in modern applications.