MAD skills: new analysis practices for big data

Authors:
Jeffrey Cohen;Brian Dolan;Mark Dunlap;Joseph M. Hellerstein;Caleb Welton
Affiliations:
Greenplum;Fox Audience Network;Evergreen Technologies;U. C. Berkeley;Greenplum
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 13
Cited 44

Encapsulation of parallelism in the Volcano query processing system

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Loading databases using dataflow parallelism

ACM SIGMOD Record
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Designing and mining multi-terabyte astronomy archives: the Sloan Digital Sky Survey

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Inclusion of New Types in Relational Data Base Systems

Proceedings of the Second International Conference on Data Engineering
Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
ZOO: A Desktop Experiment Management Environment

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Building the Data Warehouse

Building the Data Warehouse
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
From databases to dataspaces: a new abstraction for information management

ACM SIGMOD Record
Web Analytics: An Hour a Day

Web Analytics: An Hour a Day
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6

Splash: ad-hoc querying of data and statistical models

Proceedings of the 13th International Conference on Extending Database Technology
Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling

Data & Knowledge Engineering
Beyond online aggregation: parallel and incremental data mining with online Map-Reduce

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
ERACER: a database approach for statistical inference and data cleaning

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ricardo: integrating R and Hadoop

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
MCDB-R: risk analysis in the database

Proceedings of the VLDB Endowment
Big data and cloud computing: current state and future opportunities

Proceedings of the 14th International Conference on Extending Database Technology
Hybrid merge/overlap execution technique for parallel array processing

Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases
ArrayStore: a storage manager for complex parallel array processing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient processing of data warehousing queries in a split execution environment

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
The monte carlo database system: Stochastic analysis close to the data

ACM Transactions on Database Systems (TODS)
Massively parallel in-database predictions using PMML

Proceedings of the 2011 workshop on Predictive markup language modeling
The architecture of SciDB

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Analytics over large-scale multidimensional data: the big data revolution!

Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
A call to arms: revisiting database design

ACM SIGMOD Record
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
GLADE: a scalable framework for efficient analytics

ACM SIGOPS Operating Systems Review
Approximate computation and implicit regularization for very large-scale data analysis

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
NoDB: efficient query execution on raw data files

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Towards a unified architecture for in-RDBMS analytics

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
GLADE: big data analytics made easy

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Sample-based forecasting exploiting hierarchical time series

Proceedings of the 16th International Database Engineering & Applications Sysmposium
Scaling pair-wise similarity-based algorithms in tagging spaces

ICWE'12 Proceedings of the 12th international conference on Web Engineering
The MADlib analytics library: or MAD skills, the SQL

Proceedings of the VLDB Endowment
An integrated multidimensional modeling approach to access big data in business intelligence platforms

ER'12 Proceedings of the 2012 international conference on Advances in Conceptual Modeling
Predictive analytics with surveillance big data

Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data
Cumulon: optimizing statistical data analysis in the cloud

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Shark: SQL and rich analytics at scale

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter
Towards a workload for evolutionary analytics

Proceedings of the Second Workshop on Data Analytics in the Cloud
GPText: Greenplum parallel statistical text analysis framework

Proceedings of the Second Workshop on Data Analytics in the Cloud
Knowledge discovery from massive healthcare claims data

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
pEDM: online-forecasting for smart energy analytics

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Clustering cubes with binary dimensions in one pass

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Can we analyze big data inside a DBMS?

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Data warehousing and OLAP over big data: current challenges and future research directions

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Big data: a research agenda

Proceedings of the 17th International Database Engineering & Applications Symposium
PREDIcT: towards predicting the runtime of large scale iterative analytics

Proceedings of the VLDB Endowment
On the distribution of the second-largest latent root for certain high dimensional Wishart matrices

International Journal of Knowledge Engineering and Soft Data Paradigms
Creating a model of the dynamics of socio-technical groups

User Modeling and User-Adapted Interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

As massive data acquisition and storage becomes increasingly affordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world's largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present dataparallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.