Towards high-performance implementations of a custom HPC kernel using ® array building blocks

Authors:
Alexander Heinecke;Michael Klemm;Hans Pabst;Dirk Pflüger
Affiliations:
Technische Universität München, Garching, Germany;Intel GmbH, Feldkirchen, Germany;Intel GmbH, Feldkirchen, Germany;Technische Universität München, Garching, Germany
Venue:
Facing the Multicore-Challenge II
Year:
2012

Citing 11
Cited 0

Vector models for data-parallel computing

Vector models for data-parallel computing
Productivity Metrics and Models for High Performance Computing

International Journal of High Performance Computing Applications
Measuring High Performance Computing Productivity

International Journal of High Performance Computing Applications
Software and the Concurrency Revolution

Queue - Multiprocessors
A library of constructive skeletons for sequential style of parallel programming

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
OSL: Optimized Bulk Synchronous Parallel Skeletons on Distributed Arrays

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Lessons from implementing the biCGStab method with SkeTo library

Proceedings of the fourth international workshop on High-level parallel programming and applications
Enhancing Muesli's Data Parallel Skeletons for Multi-core Computer Architectures

HPCC '10 Proceedings of the 2010 IEEE 12th International Conference on High Performance Computing and Communications
The future of microprocessors

Communications of the ACM
Multi- and many-core data mining with adaptive sparse grids

Proceedings of the 8th ACM International Conference on Computing Frontiers
Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today's highly parallel machines drive a new demand for parallel programming. Fixed power envelopes, increasing problem sizes, and new algorithms pose challenging targets for developers. HPC applications must leverage SIMD units, multi-core architectures, and heterogeneous computing platforms for optimal performance. This leads to low-level, non-portable code that is difficult to write and maintain. With Intel® Array Building Blocks (Intel ArBB), programmers focus on the high-level algorithms and rely on an automatic parallelization and vectorization with strong safety guarantees. Intel ArBB hides vendorspecific hardware knowledge by runtime just-in-time (JIT) compilation. This case study on data mining with adaptive sparse grids unveils how deterministic parallelism, safety, and runtime optimization make Intel ArBB practically applicable. Hand-tuned code is about 40% faster than ArBB, but needs about 8x more code. ArBB clearly outperforms standard semi-automatically parallelized C/C++ code by approximately 6x.