Big data analytics with small footprint: squaring the cloud

Authors:
John Canny;Huasha Zhao
Affiliations:
University of California, Berkeley, Berkeley, CA, USA;University of California, Berkeley, Berkeley, CA, USA
Venue:
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2013

Citing 10
Cited 0

Collaborative filtering with privacy via factor analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
GaP: a factor model for discrete data

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Pegasos: Primal Estimated sub-GrAdient SOlver for SVM

Proceedings of the 24th international conference on Machine learning
Matrix Factorization Techniques for Recommender Systems

Computer
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
An architecture for parallel topic models

Proceedings of the VLDB Endowment
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
The MADlib analytics library: or MAD skills, the SQL

Proceedings of the VLDB Endowment
PowerGraph: distributed graph-parallel computation on natural graphs

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the BID Data Suite, a collection of hardware, software and design patterns that enable fast, large-scale data mining at very low cost. By co-designing all of these elements we achieve single-machine performance levels that equal or exceed reported cluster implementations for common benchmark problems. A key design criterion is rapid exploration of models, hence the system is interactive and primarily single-user. The elements of the suite are: (i) the data engine, a hardware design pattern that balances storage, CPU and GPU acceleration for typical data mining workloads, (ii) BIDMat, an interactive matrix library that integrates CPU and GPU acceleration and novel computational kernels (iii), BIDMach, a machine learning system that includes very efficient model optimizers, (iv) Butterfly mixing, a communication strategy that hides the latency of frequent model updates needed by fast optimizers and (v) Design patterns to improve performance of iterative update algorithms. We present several benchmark problems to show how the above elements combine to yield multiple orders-of-magnitude improvements for each problem.