Parallelism and data movement characterization of contemporary application classes

Authors:
Victoria Caparrós Cabezas;Phillip Stanley-Marbell
Affiliations:
IBM Research, Rüschlikon, Switzerland;IBM Research, Rüschlikon, Switzerland
Venue:
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Year:
2011

Citing 25
Cited 2

Limits of instruction-level parallelism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Limits of control flow on parallelism

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Dynamic dependency analysis of ordinary programs

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
On the limits of program parallelism and its smoothability

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Memory bandwidth limitations of future microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
The SimpleScalar tool set, version 2.0

ACM SIGARCH Computer Architecture News
Automatic data layout for distributed-memory machines

ACM Transactions on Programming Languages and Systems (TOPLAS)
The limits of instruction level parallelism in SPEC95 applications

ACM SIGARCH Computer Architecture News - Special issue on Interact-3 workshop
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Exploring the Limits of Sub-Word Level Parallelism

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Communication characteristics of large-scale scientific applications for contemporary cluster architectures

Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
Automatic Thread Extraction with Decoupled Software Pipelining

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Detection and Parallel Execution of Independent Instructions

IEEE Transactions on Computers
The Inhibition of Potential Parallelism by Conditional Jumps

IEEE Transactions on Computers
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Revisiting the Sequential Programming Model for Multi-Core

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
The shared-thread multiprocessor

Proceedings of the 22nd annual international conference on Supercomputing
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
Toward more efficient computer organizations

AFIPS '72 (Spring) Proceedings of the May 16-18, 1972, spring joint computer conference
Introduction to Algorithms, Third Edition

Introduction to Algorithms, Third Edition
Power7: IBM's Next-Generation Server Processor

IEEE Micro
The Cilkview scalability analyzer

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures

Quantitative analysis of parallelism and data movement properties across the Berkeley computational motifs

Proceedings of the 8th ACM International Conference on Computing Frontiers
DOME: towards the ASTRON & IBM center for exascale technology

Proceedings of the 2012 workshop on High-Performance Computing for Astronomy Date

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper presents a framework for characterizing the distribution of fine-grained parallelism, data movement, and communication-minimizing code partitions. Understanding the spectrum of parallelism available in applications, and how much data movement might result if such parallelism is exploited, is essential in the hardware design process because these properties will be the limiters to performance scaling of future computing systems. The framework is applied to characterizing 26 applications and kernels, classified according to their dominant components in the Berkeley dwarf/ computational motif classification. The distributions of ILP and TLP over execution time are studied, and it is shown that, though mean ILP is high, available ILP is significantly smaller for most of the execution. The results from this framework are complemented by hardware performance counter data on two RISC platforms (IBM Power7 and Freescale P2020) and one CISC platform (IntelAtom D510), spanning a broad range of real machine characteristics. Employing a combination of these new techniques, and building upon previous proposals, it is demonstrated that the similarity in available ideal-case parallelism and data movement within and across the dwarf classes, is limited.