Scalable progressive analytics on big data in the cloud

Authors:
Badrish Chandramouli;Jonathan Goldstein;Abdul Quamar
Affiliations:
Microsoft Research, Redmond;Microsoft Research, Redmond;University of Maryland, College Park
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 23
Cited 0

Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Informix under CONTROL: Online Query Processing

Data Mining and Knowledge Discovery
Online Dynamic Reordering for Interactive Data Processing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Effective use of block-level sampling in statistics estimation

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Revision Processing in a Stream Processing Engine: A High-Level Design

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Scalable approximate query processing with the DBO engine

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
A latency and fault-tolerance optimizer for online parallel query plans

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Nobody ever got fired for using Hadoop on a cluster

Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Temporal Analytics on Big Data for Web Advertising

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Early accurate results for advanced analytics on MapReduce

Proceedings of the VLDB Endowment
Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Processing a trillion cells per mouse click

Proceedings of the VLDB Endowment
BlinkDB: queries with bounded errors and bounded response times on very large data

Proceedings of the 8th ACM European Conference on Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Analytics over the increasing quantity of data stored in the Cloud has become very expensive, particularly due to the pay-as-you-go Cloud computation model. Data scientists typically manually extract samples of increasing data size (progressive samples) using domain-specific sampling strategies for exploratory querying. This provides them with user-control, repeatable semantics, and result provenance. However, such solutions result in tedious workflows that preclude the reuse of work across samples. On the other hand, existing approximate query processing systems report early results, but do not offer the above benefits for complex ad-hoc queries. We propose a new progressive analytics system based on a progress model called Prism that (1) allows users to communicate progressive samples to the system; (2) allows efficient and deterministic query processing over samples; and (3) provides repeatable semantics and provenance to data scientists. We show that one can realize this model for atemporal relational queries using an unmodified temporal streaming engine, by re-interpreting temporal event fields to denote progress. Based on Prism, we build Now!, a progressive data-parallel computation framework for Windows Azure, where progress is understood as a first-class citizen in the framework. Now! works with "progress-aware reducers"- in particular, it works with streaming engines to support progressive SQL over big data. Extensive experiments on Windows Azure with real and synthetic workloads validate the scalability and benefits of Now! and its optimizations, over current solutions for progressive analytics.