Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Chord: A scalable peer-to-peer lookup service for internet applications
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Guest Editors' Introduction: The Top 10 Algorithms
Computing in Science and Engineering
The Decompositional Approach to Matrix Computation
Computing in Science and Engineering
PerDiS: Design, Implementation, and Use of a PERsistent DIstributed Store
Advances in Distributed Systems, Advanced Distributed Computing: From Algorithms to Systems
Google news personalization: scalable online collaborative filtering
Proceedings of the 16th international conference on World Wide Web
On synopses for distinct-value estimation under multiset operations
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
A Unified View of Matrix Factorization Models
ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
FlexRecs: expressing and combining flexible recommendations
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Scalable proximity estimation and link prediction in online social networks
Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference
A survey of collaborative filtering techniques
Advances in Artificial Intelligence
Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce
Proceedings of the 19th international conference on World wide web
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing
Proceedings of the 1st ACM symposium on Cloud computing
Pregel: a system for large-scale graph processing
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Twister: a runtime for iterative MapReduce
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Matrix Completion from Noisy Entries
The Journal of Machine Learning Research
Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs
GPCE '10 Proceedings of the ninth international conference on Generative programming and component engineering
Decomposing background topics from keywords by principal component pursuit
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
HaLoop: efficient iterative data processing on large clusters
Proceedings of the VLDB Endowment
Piccolo: building fast, distributed programs with partitioned tables
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Mesos: a platform for fine-grained resource sharing in the data center
Proceedings of the 8th USENIX conference on Networked systems design and implementation
SystemML: Declarative machine learning on MapReduce
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Hyracks: A flexible and extensible foundation for data-intensive computing
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Managing data transfers in computer clusters with orchestra
Proceedings of the ACM SIGCOMM 2011 conference
Large-scale matrix factorization with distributed stochastic gradient descent
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
MadLINQ: large-scale distributed matrix computation for the cloud
Proceedings of the 7th ACM european conference on Computer Systems
Towards a unified architecture for in-RDBMS analytics
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Using R for iterative and incremental processing
HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Hi-index | 0.00 |
Low-rank matrix factorization has recently been applied with great success on matrix completion problems for applications like recommendation systems, link predictions for social networks, and click prediction for web search. However, as this approach is applied to increasingly larger datasets, such as those encountered in web-scale recommender systems like Netflix and Pandora, the data management aspects quickly become challenging and form a road-block. In this paper, we introduce a system called Sparkler to solve such large instances of low rank matrix factorizations. Sparkler extends Spark, an existing platform for running parallel iterative algorithms on datasets that fit in the aggregate main memory of a cluster. Sparkler supports distributed stochastic gradient descent as an approach to solving the factorization problem -- an iterative technique that has been shown to perform very well in practice. We identify the shortfalls of Spark in solving large matrix factorization problems, especially when running on the cloud, and solve this by introducing a novel abstraction called "Carousel Maps" (CMs). CMs are well suited to storing large matrices in the aggregate memory of a cluster and can efficiently support the operations performed on them during distributed stochastic gradient descent. We describe the design, implementation, and the use of CMs in Sparkler programs. Through a variety of experiments, we demonstrate that Sparkler is faster than Spark by 4x to 21x, with bigger advantages for larger problems. Equally importantly, we show that this can be done without imposing any changes to the ease of programming. We argue that Sparkler provides a convenient and efficient extension to Spark for solving matrix factorization problems on very large datasets.