Pervasive parallelism in data mining: dataflow solution to co-clustering large and sparse Netflix data

Authors:
Srivatsava Daruru;Nena M. Marin;Matt Walker;Joydeep Ghosh
Affiliations:
The University of Texas at Austin, Austin, TX, USA;Pervasive Software, Inc., Austin, TX, USA;Pervasive Software, Inc., Austin, TX, USA;The University of Texas at Austin, Austin, TX, USA
Venue:
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2009

Citing 8
Cited 6

A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
A Scalable Collaborative Filtering Framework Based on Co-Clustering

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Predictive discrete latent factor models for large scale dyadic data

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Modeling relationships at multiple scales to improve accuracy of large recommender systems

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
A framework for simultaneous co-clustering and learning from complex data

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

The Journal of Machine Learning Research
Factorization meets the neighborhood: a multifaceted collaborative filtering model

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

HADI: Mining Radii of Large Graphs

ACM Transactions on Knowledge Discovery from Data (TKDD)
Garbage collection auto-tuning for Java mapreduce on multi-cores

Proceedings of the international symposium on Memory management
Approximate kernel k-means: solution to large scale kernel clustering

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Distributed scalable collaborative filtering algorithm

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
A parallel matrix factorization based recommender by alternating stochastic gradient decent

Engineering Applications of Artificial Intelligence
Mining order-preserving submatrices from probabilistic matrices

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

All Netflix Prize algorithms proposed so far are prohibitively costly for large-scale production systems. In this paper, we describe an efficient dataflow implementation of a collaborative filtering (CF) solution to the Netflix Prize problem [1] based on weighted coclustering [5]. The dataflow library we use facilitates the development of sophisticated parallel programs designed to fully utilize commodity multicore hardware, while hiding traditional difficulties such as queuing, threading, memory management, and deadlocks. The dataflow CF implementation first compresses the large, sparse training dataset into co-clusters. Then it generates recommendations by combining the average ratings of the co-clusters with the biases of the users and movies. When configured to identify 20x20 co-clusters in the Netflix training dataset, the implementation predicted over 100 million ratings in 16.31 minutes and achieved an RMSE of 0.88846 without any fine-tuning or domain knowledge. This is an effective real-time prediction runtime of 9.7 us per rating which is far superior to previously reported results. Moreover, the implemented co-clustering framework supports a wide variety of other large-scale data mining applications and forms the basis for predictive modeling on large, dyadic datasets [4, 7].