Sparkler: supporting large-scale matrix factorization

Authors:
Boduo Li;Sandeep Tata;Yannis Sismanis
Affiliations:
University of Massachusetts Amherst, Amherst, MA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA
Venue:
Proceedings of the 16th International Conference on Extending Database Technology
Year:
2013

Citing 31
Cited 0

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Guest Editors' Introduction: The Top 10 Algorithms

Computing in Science and Engineering
The Decompositional Approach to Matrix Computation

Computing in Science and Engineering
PerDiS: Design, Implementation, and Use of a PERsistent DIstributed Store

Advances in Distributed Systems, Advanced Distributed Computing: From Algorithms to Systems
Google news personalization: scalable online collaborative filtering

Proceedings of the 16th international conference on World Wide Web
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
A Unified View of Matrix Factorization Models

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
FlexRecs: expressing and combining flexible recommendations

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Matrix Factorization Techniques for Recommender Systems

Computer
Scalable proximity estimation and link prediction in online social networks

Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference
A survey of collaborative filtering techniques

Advances in Artificial Intelligence
Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

Proceedings of the 19th international conference on World wide web
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Matrix Completion from Noisy Entries

The Journal of Machine Learning Research
Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs

GPCE '10 Proceedings of the ninth international conference on Generative programming and component engineering
Decomposing background topics from keywords by principal component pursuit

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Mesos: a platform for fine-grained resource sharing in the data center

Proceedings of the 8th USENIX conference on Networked systems design and implementation
SystemML: Declarative machine learning on MapReduce

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Hyracks: A flexible and extensible foundation for data-intensive computing

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Managing data transfers in computer clusters with orchestra

Proceedings of the ACM SIGCOMM 2011 conference
Large-scale matrix factorization with distributed stochastic gradient descent

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
MadLINQ: large-scale distributed matrix computation for the cloud

Proceedings of the 7th ACM european conference on Computer Systems
Towards a unified architecture for in-RDBMS analytics

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Using R for iterative and incremental processing

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Low-rank matrix factorization has recently been applied with great success on matrix completion problems for applications like recommendation systems, link predictions for social networks, and click prediction for web search. However, as this approach is applied to increasingly larger datasets, such as those encountered in web-scale recommender systems like Netflix and Pandora, the data management aspects quickly become challenging and form a road-block. In this paper, we introduce a system called Sparkler to solve such large instances of low rank matrix factorizations. Sparkler extends Spark, an existing platform for running parallel iterative algorithms on datasets that fit in the aggregate main memory of a cluster. Sparkler supports distributed stochastic gradient descent as an approach to solving the factorization problem -- an iterative technique that has been shown to perform very well in practice. We identify the shortfalls of Spark in solving large matrix factorization problems, especially when running on the cloud, and solve this by introducing a novel abstraction called "Carousel Maps" (CMs). CMs are well suited to storing large matrices in the aggregate memory of a cluster and can efficiently support the operations performed on them during distributed stochastic gradient descent. We describe the design, implementation, and the use of CMs in Sparkler programs. Through a variety of experiments, we demonstrate that Sparkler is faster than Spark by 4x to 21x, with bigger advantages for larger problems. Equally importantly, we show that this can be done without imposing any changes to the ease of programming. We argue that Sparkler provides a convenient and efficient extension to Spark for solving matrix factorization problems on very large datasets.