CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition

Authors:
Feng Pan;Xiang Zhang;Wei Wang
Affiliations:
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA;University of North Carolina at Chapel Hill, Chapel Hill, NC, USA;University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Venue:
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Year:
2008

Citing 13
Cited 5

Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Competitive recommendation systems

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Biclustering of Expression Data

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Fully automatic cross-associations

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A generalized maximum entropy approach to bregman co-clustering and matrix approximation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A general model for clustering binary data

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Co-clustering by block value decomposition

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition

SIAM Journal on Computing
Orthogonal nonnegative matrix t-factorizations for clustering

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Effective data co-reduction for multimedia similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Randomized Algorithms for Matrices and Data

Foundations and Trends® in Machine Learning
Low-Rank matrix factorization and co-clustering algorithms for analyzing large data sets

ICDEM'10 Proceedings of the Second international conference on Data Engineering and Management
Can we analyze big data inside a DBMS?

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Mining order-preserving submatrices from probabilistic matrices

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, microarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually O(m X n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.