Memory-efficient groupby-aggregate using compressed buffer trees

Authors:
Hrishikesh Amur;Wolfgang Richter;David G. Andersen;Michael Kaminsky;Karsten Schwan;Athula Balachandran;Erik Zawadzki
Affiliations:
Georgia Institute of Technology;Carnegie Mellon University;Carnegie Mellon University;Intel Labs;Georgia Institute of Technology;Carnegie Mellon University;Carnegie Mellon University
Venue:
Proceedings of the 4th annual Symposium on Cloud Computing
Year:
2013

Citing 34
Cited 0

The input/output complexity of sorting and related problems

Communications of the ACM
The design and implementation of a log-structured file system

ACM Transactions on Computer Systems (TOCS)
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
The log-structured merge-tree (LSM-tree)

Acta Informatica
A very fast algorithm for RAM compression

ACM SIGOPS Operating Systems Review
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
An overview of query optimization in relational systems

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
An alternative storage organization for ROLAP aggregate views based on cubetrees

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Bottom-up computation of sparse and Iceberg CUBE

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Extendible hashing—a fast access method for dynamic files

ACM Transactions on Database Systems (TODS)
Query optimization in compressed database systems

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Aggregation Algorithms for Very Large Compressed Data Warehouses

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Performance of Hardware Compressed Main Memory

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
N-gram-based Machine Translation

Computational Linguistics
Cache-oblivious streaming B-trees

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Bigtable: A Distributed Storage System for Structured Data

ACM Transactions on Computer Systems (TOCS)
SPADE: the system s declarative stream processing engine

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Disaggregated memory for expansion and sharing in blade servers

Proceedings of the 36th annual international symposium on Computer architecture
Distributed aggregation for data-parallel computing: interfaces and implementations

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Real-word spelling correction using Google Web IT 3-grams

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Cassandra: a decentralized structured storage system

ACM SIGOPS Operating Systems Review
A Map-Reduce System with an Alternate API for Multi-core Environments

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Phoenix++: modular MapReduce for shared-memory systems

Proceedings of the second international workshop on MapReduce and its applications
bLSM: a general purpose log structured merge tree

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
Compression of individual sequences via variable-rate coding

IEEE Transactions on Information Theory
Muppet: MapReduce-style processing of fast data

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

The rapid growth of fast analytics systems, that require data processing in memory, makes memory capacity an increasingly-precious resource. This paper introduces a new compressed data structure called a Compressed Buffer Tree (CBT). Using a combination of techniques including buffering, compression, and serialization, CBTs improve the memory efficiency and performance of the GroupBy-Aggregate abstraction that forms the basis of not only batch-processing models like MapReduce, but recent fast analytics systems too. For streaming workloads, aggregation using the CBT uses 21--42% less memory than using Google SparseHash with up to 16% better throughput. The CBT is also compared to batch-mode aggregators in MapReduce runtimes such as Phoenix++ and Metis and consumes 4x and 5x less memory with 1.5--2x and 3--4x more performance respectively.