Improving online aggregation performance for skewed data distribution

Authors:
Yuxiang Wang;Junzhou Luo;Aibo Song;Jiahui Jin;Fang Dong
Affiliations:
School of Computer Science and Engineering, Southeast University, Nanjing, P.R. China;School of Computer Science and Engineering, Southeast University, Nanjing, P.R. China;School of Computer Science and Engineering, Southeast University, Nanjing, P.R. China;School of Computer Science and Engineering, Southeast University, Nanjing, P.R. China;School of Computer Science and Engineering, Southeast University, Nanjing, P.R. China
Venue:
DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
Year:
2012

Citing 12
Cited 0

The Datacycle architecture

Communications of the ACM - Special issue on information filtering
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A scalable hash ripple join algorithm

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Overcoming Limitations of Sampling for Aggregation Queries

Proceedings of the 17th International Conference on Data Engineering
Large-Sample and Deterministic Confidence Intervals for Online Aggregation

SSDBM '97 Proceedings of the Ninth International Conference on Scientific and Statistical Database Management
The pathologies of big data

Communications of the ACM - A Blind Person's Interaction with Technology
A scalable, predictable join operator for highly concurrent data warehouses

Proceedings of the VLDB Endowment
Distributed online aggregations

Proceedings of the VLDB Endowment
Continuous sampling for online aggregation over multiple queries

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Online aggregation and continuous query support in MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Hyracks: A flexible and extensible foundation for data-intensive computing

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Online aggregation is a commonly-used technique to response aggregation queries with the refined approximate answers (within an estimated confidence interval) quickly. However, we observe that low selectivity and inappropriate sample proportion significantly affect the online aggregation performance when the data distribution is skewed. To overcome this problem, we propose a Partition-based Online Aggregation System called POAS. In POAS, the side effect of low selectivity can be reduced by efficient pruning of unneeded data due to the partition and shuffle strategies, and the appropriate sample proportion can be achieved as far as possible by drawing samples (tuples) from relevant partitions with dynamic sample size. Moreover, POAS applies some statistical approaches to calculate estimates from relevant partitions. We have implemented POAS and conducted an extensive experiments study on the TPC-H benchmark for skewed data distribution. Our results demonstrate the efficiency and effectiveness of POAS.