SQLEM: fast clustering in SQL using the EM algorithm
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
Data Mining and Knowledge Discovery
Data Mining: An Overview from a Database Perspective
IEEE Transactions on Knowledge and Data Engineering
Latent semantic models for collaborative filtering
ACM Transactions on Information Systems (TOIS)
Efficient Disk-Based K-Means Clustering for Relational Databases
IEEE Transactions on Knowledge and Data Engineering
A generalized maximum entropy approach to bregman co-clustering and matrix approximation
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Programming the K-means clustering algorithm in SQL
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Integrating K-Means Clustering with a Relational DBMS Using SQL
IEEE Transactions on Knowledge and Data Engineering
Predictive discrete latent factor models for large scale dyadic data
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
The Journal of Machine Learning Research
Data mining using relational database management systems
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Distributed scalable collaborative filtering algorithm
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Hi-index | 0.00 |
Consider an MxN matrix, where the (i,j)th entry represents the affinity between the i_th entity of the first type and the j_th entity of the second type. Co-clustering is an approach to simultaneously cluster both types of entities, using the affinities as the information guiding the clustering. Co-clustering has been found to achieve clustering and dimensionality reduction at the same time, and therefore it is finding application in various problems. Bregman co-clustering algorithm, which has been recently proposed, converts the co-clustering task to the search for an optimal approximation matrix. It is much more scalable but memory-based implementations have a severe computational bottleneck. In this paper we show that a significant fraction of computations performed by the Bregman co-clustering algorithm naturally map to those performed by an on-line analytical processing (OLAP) engine, making the latter a well suited data management engine for the algorithm. Based on this observation, we have developed a version of Bregman co-clustering algorithm that works on top of OLAP. Our experiments show that this version is much more scalable, achieving an order of magnitude performance improvement over the memory-based implementation. We believe this unlocks the power of this novel technique for application to much larger datasets.