Improving Generalization with Active Learning
Machine Learning - Special issue on structured connectionist systems
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Informix under CONTROL: Online Query Processing
Data Mining and Knowledge Discovery
Online Dynamic Reordering for Interactive Data Processing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Effective use of block-level sampling in statistics estimation
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Revision Processing in a Stream Processing Engine: A High-Level Design
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Scalable approximate query processing with the DBO engine
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Hadoop: The Definitive Guide
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Dremel: interactive analysis of web-scale datasets
Proceedings of the VLDB Endowment
A latency and fault-tolerance optimizer for online parallel query plans
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A platform for scalable one-pass analytics using MapReduce
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Nobody ever got fired for using Hadoop on a cluster
Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing
SkewTune: mitigating skew in mapreduce applications
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Temporal Analytics on Big Data for Web Advertising
ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Early accurate results for advanced analytics on MapReduce
Proceedings of the VLDB Endowment
Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters
HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Processing a trillion cells per mouse click
Proceedings of the VLDB Endowment
BlinkDB: queries with bounded errors and bounded response times on very large data
Proceedings of the 8th ACM European Conference on Computer Systems
Hi-index | 0.00 |
Analytics over the increasing quantity of data stored in the Cloud has become very expensive, particularly due to the pay-as-you-go Cloud computation model. Data scientists typically manually extract samples of increasing data size (progressive samples) using domain-specific sampling strategies for exploratory querying. This provides them with user-control, repeatable semantics, and result provenance. However, such solutions result in tedious workflows that preclude the reuse of work across samples. On the other hand, existing approximate query processing systems report early results, but do not offer the above benefits for complex ad-hoc queries. We propose a new progressive analytics system based on a progress model called Prism that (1) allows users to communicate progressive samples to the system; (2) allows efficient and deterministic query processing over samples; and (3) provides repeatable semantics and provenance to data scientists. We show that one can realize this model for atemporal relational queries using an unmodified temporal streaming engine, by re-interpreting temporal event fields to denote progress. Based on Prism, we build Now!, a progressive data-parallel computation framework for Windows Azure, where progress is understood as a first-class citizen in the framework. Now! works with "progress-aware reducers"- in particular, it works with streaming engines to support progressive SQL over big data. Extensive experiments on Windows Azure with real and synthetic workloads validate the scalability and benefits of Now! and its optimizations, over current solutions for progressive analytics.