Scalable aggregation on multicore processors

Authors:
Yang Ye;Kenneth A. Ross;Norases Vesdapunt
Affiliations:
Columbia University, New York NY;Columbia University, New York NY;Columbia University, New York NY
Venue:
Proceedings of the Seventh International Workshop on Data Management on New Hardware
Year:
2011

Citing 10
Cited 8

Quickly generating billion-record synthetic databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Adaptive parallel aggregation algorithms

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Database Architecture Optimized for the New Bottleneck: Memory Access

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
DBMSs on a Modern Processor: Where Does Time Go?

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Improving Hash Join Performance through Prefetching

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Improving database performance on simultaneous multithreading processors

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Adaptive aggregation on chip multiprocessors

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Data partitioning on chip multiprocessors

Proceedings of the 4th international workshop on Data management on new hardware
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Automatic contention detection and amelioration for data-intensive operations

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

A comparison of the use of virtual versus physical snapshots for supporting update-intensive workloads

DaMoN '12 Proceedings of the Eighth International Workshop on Data Management on New Hardware
Ameliorating memory contention of OLAP operators on GPU processors

DaMoN '12 Proceedings of the Eighth International Workshop on Data Management on New Hardware
Efficient frequent item counting in multi-core hardware

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Riposte: a trace-driven compiler and parallel VM for vector code in R

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
High throughput heavy hitter aggregation for modern SIMD processors

Proceedings of the Ninth International Workshop on Data Management on New Hardware
Navigating big data with high-throughput, energy-efficient data partitioning

Proceedings of the 40th Annual International Symposium on Computer Architecture
MacroDB: scaling database engines on multicores

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Adaptive and big data scale parallel execution in oracle

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

In data-intensive and multi-threaded programming, the performance bottleneck has shifted from I/O bandwidth to main memory bandwidth. The availability, size, and other properties of on-chip cache strongly influence performance. A key question is whether to allow different threads to work independently, or whether to coordinate the shared workload among the threads. The independent approach avoids synchronization overhead, but requires resources proportional to the number of threads and thus is not scalable. On the other hand, the shared method suffers from coordination overhead and potential contention. In this paper, we aim to provide a solution to performing in-memory parallel aggregation on the Intel Nehalem architecture. We consider several previously proposed techniques that were evaluated on other architectures, including a hybrid independent/shared method and a method that clones data items automatically when contention is detected. We also propose two algorithms: partition-and-aggregate and PLAT. The PLAT and hybrid methods perform best overall, utilizing the computational power of multiple threads without needing memory proportional to the number of threads, and avoiding much of the coordination overhead and contention apparent in the shared table method.