You can stop early with COLA: online processing of aggregate queries in the cloud

  • Authors:
  • Yingjie Shi;Xiaofeng Meng;Fusheng Wang;Yantao Gan

  • Affiliations:
  • Renmin University of China, Beijing, China;Renmin University of China, Beijing, China;Emory University, Atlanta, GA, USA;Renmin University of China, Beijing, China

  • Venue:
  • Proceedings of the 21st ACM international conference on Information and knowledge management
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cloud-based data management systems are emerging as scalable, fault-tolerant, and efficient solutions to manage large volumes of data with cost effective infrastructures, and more and more data analysis applications are migrated to the cloud. As an attractive solution to provide a quick sketch of massive data before a long wait of the final accurate query result, online processing of aggregate queries in the cloud is of paramount importance. This problem is challenging to solve because of the large block based data organization and distributed processing mode in the cloud. In this paper, we present COLA, a system for Cloud Online Aggregation to provide progressive approximate answers for both single tables and joined multiple tables. We develop an online query processing algorithm for MapReduce to support incremental and continuous computing of aggregations on joins which minimizes the waiting time before an acceptable estimate is achieved. We formulate a statistical foundation that supports block-level sampling for single-table online aggregations and effective estimation of approximate results and confidence intervals of statistical significance. We also develop a two-phase stratified sampling method to support multi-table aggregations to improve the approximate query answers and speed up the convergence of confidence intervals. We implement COLA in Hadoop, and our experiments demonstrate that COLA can deliver reasonable precise online estimates within a time period two orders of magnitude shorter than that used to produce exact answers.