Revisiting aggregation techniques for big data

  • Authors:
  • Vassilis J. Tsotras

  • Affiliations:
  • University of California, Riverside, Riverside, CA, USA

  • Venue:
  • Proceedings of the sixteenth international workshop on Data warehousing and OLAP
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this talk we first present an introduction to AsterixDB [1], a parallel, semistructured platform to ingest, store, index, query, analyze, and publish "big data" (http://asterixdb.ics.uci.edu) and the various challenges we addressed while building it. AsterixDB combines ideas from semistructured data management, parallel database systems, and first-generation data-intensive computing platforms (MapReduce and Hadoop). The full AsterixDB software stack provides support for big data applications from the storage and processing engine (Hyracks [2] available at: http://hyracks.googlecode.com), to the exible query optimization layer (Algebricks), to the interfaces for user-level interaction (AQL, HiveQL, Pregelix, etc.) Hyracks is a partitioned-parallel engine for data intensive computing jobs in the form of DAGs. Algebricks is a model-agnostic, algebraic layer for compiling and optimizing parallel queries to be processed by Hyracks. Queries for AsterixDB can be expressed by either popular higher-level data analysis languages like Pig, Hive or Jaql, or by its native query language (AQL) and data model (ADM) with support for semi-structured information and fuzzy data. Fundamental data processing operations, like joins and aggregations, are natively supported in AsterixDB. The second part of the talk focuses on our experiences while designing efficient local (per node) aggregation algorithms for AsterixDB. In particular, there are two challenges for local aggregations in a big data system: first, if the aggregation is group-based (like the "group-by" in SQL), the aggregation result may not fit in main memory; second, in order to allow multiple operations being processed simultaneously, an aggregation operation should work within a strict memory budget provided by the platform. Despite its importance and challenges, the design and evaluation of local aggregation algorithms has not received the same level of attention that other basic operators, such as joins, have received in the literature. Facing a lack of "off the shelf" local aggregation algorithms for big data, we present low-level implementation details for engineering the aggregation operator, utilizing (i) sort-based, (ii) hash-based, and (iii) sort-hash-hybrid approaches. We present six algorithms all of which work within a strictly bounded memory budget, and can easily adapt between in-memory and external processing. Among them, two are novel and four are based on extending existing join algorithms. We deployed all algorithms as operators in the Hyracks platform and evaluated their performance through extensive experimentation. Our experiments cover many different performance factors, including input cardinality, memory, data distribution, and hash table structure. Our study guided our selection of the local aggregation algorithms supported in the recent release of AsterixDB, namely: the hybrid-hash. Pre-Partitioning algorithm for its tolerance on the estimation of the input grouping key cardinality, the Hash-Sort algorithm for its good performance when aggregating skewed data, and the Sort-Based algorithm when the input data is already sorted. This local aggregation work is the first part of a two-part big data aggregation study, as it addresses the "map" phase. Our findings provide the foundation for the global aggregation strategy we are currently investigating for the "reduce" phase. We hope our experience can help developers of other Big Data platforms to build a solid local aggregation operator.