Towards a unified architecture for in-RDBMS analytics

Authors:
Xixuan Feng;Arun Kumar;Benjamin Recht;Christopher Ré
Affiliations:
University of Wisconsin-Madison, Madison, WI, USA;University of Wisconsin-Madison, Madison, WI, USA;University of Wisconsin-Madison, Madison, WI, USA;University of Wisconsin-Madison, Madison, WI, USA
Venue:
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Year:
2012

Citing 24
Cited 11

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
On the convergence of the LMS algorithm with adaptive learning rate for linear feedforward networks

Neural Computation
Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
An Incremental Gradient(-Projection) Method with Momentum Term and Adaptive Stepsize Rule

SIAM Journal on Optimization
A New Class of Incremental Gradient Methods for Least Squares Problems

SIAM Journal on Optimization
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Approximation algorithms for MAX-3-CUT and other problems via complex semidefinite programming

Journal of Computer and System Sciences - STOC 2001
Convex Optimization

Convex Optimization
SVM in oracle database 10g: removing the barriers to widespread adoption of support vector machines

VLDB '05 Proceedings of the 31st international conference on Very large data bases
MauveDB: supporting model-based user views in database systems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Creating probabilistic databases from information extraction models

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Optimal algorithms and inapproximability results for every CSP?

STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
MCDB: a monte carlo approach to managing uncertain data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
BayesStore: managing large, uncertain data repositories with probabilistic graphical models

Proceedings of the VLDB Endowment
Exploiting shared correlations in probabilistic databases

Proceedings of the VLDB Endowment
Two “well-known” properties of subgradient optimization

Mathematical Programming: Series A and B - Series B - Special Issue: Nonsmooth Optimization and Applications
Sparse Online Learning via Truncated Gradient

The Journal of Machine Learning Research
Robust Stochastic Approximation Approach to Stochastic Programming

SIAM Journal on Optimization
P-packSVM: Parallel Primal grAdient desCent Kernel SVM

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment
Scalable probabilistic databases with factor graphs and MCMC

Proceedings of the VLDB Endowment
Querying probabilistic information extraction

Proceedings of the VLDB Endowment
Large-scale matrix factorization with distributed stochastic gradient descent

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Towards a unified architecture for in-RDBMS analytics

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Towards a unified architecture for in-RDBMS analytics

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
The MADlib analytics library: or MAD skills, the SQL

Proceedings of the VLDB Endowment
Hazy: making it easier to build and maintain big-data analytics

Communications of the ACM
Hazy: Making it Easier to Build and Maintain Big-data Analytics

Queue - Web Development
A performance comparison of parallel DBMSs and MapReduce on large-scale text analytics

Proceedings of the 16th International Conference on Extending Database Technology
Sparkler: supporting large-scale matrix factorization

Proceedings of the 16th International Conference on Extending Database Technology
Shark: SQL and rich analytics at scale

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Adaptive exploration for large-scale protein analysis in the molecular dynamics database

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Scalable I/O-bound parallel incremental gradient descent for big data analytics in GLADE

Proceedings of the Second Workshop on Data Analytics in the Cloud
GPText: Greenplum parallel statistical text analysis framework

Proceedings of the Second Workshop on Data Analytics in the Cloud
Audience segment expansion using distributed in-database k-means clustering

Proceedings of the Seventh International Workshop on Data Mining for Online Advertising

Quantified Score

Hi-index	0.02

Visualization

Abstract

The increasing use of statistical data analysis in enterprise applications has created an arms race among database vendors to offer ever more sophisticated in-database analytics. One challenge in this race is that each new statistical technique must be implemented from scratch in the RDBMS, which leads to a lengthy and complex development process. We argue that the root cause for this overhead is the lack of a unified architecture for in-database analytics. Our main contribution in this work is to take a step towards such a unified architecture. A key benefit of our unified architecture is that performance optimizations for analytics techniques can be studied generically instead of an ad hoc, per-technique fashion. In particular, our technical contributions are theoretical and empirical studies of two key factors that we found impact performance: the order data is stored, and parallelization of computations on a single-node multicore RDBMS. We demonstrate the feasibility of our architecture by integrating several popular analytics techniques into two commercial and one open-source RDBMS. Our architecture requires changes to only a few dozen lines of code to integrate a new statistical technique. We then compare our approach with the native analytics tools offered by the commercial RDBMSes on various analytics tasks, and validate that our approach achieves competitive or higher performance, while still achieving the same quality.