You can stop early with COLA: online processing of aggregate queries in the cloud

Authors:
Yingjie Shi;Xiaofeng Meng;Fusheng Wang;Yantao Gan
Affiliations:
Renmin University of China, Beijing, China;Renmin University of China, Beijing, China;Emory University, Atlanta, GA, USA;Renmin University of China, Beijing, China
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 17
Cited 1

Random sampling from B+ trees

VLDB '89 Proceedings of the 15th international conference on Very large data bases
Selectivity and cost estimation for joins based on random sampling

Journal of Computer and System Sciences
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A scalable hash ripple join algorithm

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Sampling Issues in Parallel Database Systems

EDBT '92 Proceedings of the 3rd International Conference on Extending Database Technology: Advances in Database Technology
Random Sampling from Pseudo-Ranked B+ Trees

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Random Sampling from Database Files: A Survey

Proceedings of the 5th International Conference SSDBM on Statistical and Scientific Database Management
Large-Sample and Deterministic Confidence Intervals for Online Aggregation

SSDBM '97 Proceedings of the Ninth International Conference on Scientific and Statistical Database Management
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A bi-level Bernoulli scheme for database sampling

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Effective use of block-level sampling in statistics estimation

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A disk-based join with probabilistic guarantees

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Distributed online aggregations

Proceedings of the VLDB Endowment
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Online aggregation and continuous query support in MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Processing online aggregation on skewed data in mapreduce

Proceedings of the fifth international workshop on Cloud data management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cloud-based data management systems are emerging as scalable, fault-tolerant, and efficient solutions to manage large volumes of data with cost effective infrastructures, and more and more data analysis applications are migrated to the cloud. As an attractive solution to provide a quick sketch of massive data before a long wait of the final accurate query result, online processing of aggregate queries in the cloud is of paramount importance. This problem is challenging to solve because of the large block based data organization and distributed processing mode in the cloud. In this paper, we present COLA, a system for Cloud Online Aggregation to provide progressive approximate answers for both single tables and joined multiple tables. We develop an online query processing algorithm for MapReduce to support incremental and continuous computing of aggregations on joins which minimizes the waiting time before an acceptable estimate is achieved. We formulate a statistical foundation that supports block-level sampling for single-table online aggregations and effective estimation of approximate results and confidence intervals of statistical significance. We also develop a two-phase stratified sampling method to support multi-table aggregations to improve the approximate query answers and speed up the convergence of confidence intervals. We implement COLA in Hadoop, and our experiments demonstrate that COLA can deliver reasonable precise online estimates within a time period two orders of magnitude shorter than that used to produce exact answers.