The Cilkview scalability analyzer

Authors:
Yuxiong He;Charles E. Leiserson;William M. Leiserson
Affiliations:
Microsoft Research, Redmond, WA, USA;MIT, Cambridge, MA, USA;Intel, Nashua, NH, USA
Venue:
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Year:
2010

Citing 34
Cited 11

Speedup Versus Efficiency in Parallel Systems

IEEE Transactions on Computers
Quartz: a tool for tuning parallel program performance

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Performance debugging shared memory multiprocessor programs with MTOOL

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Space-Efficient Scheduling of Multithreaded Computations

SIAM Journal on Computing
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
The Parallel Evaluation of General Arithmetic Expressions

Journal of the ACM (JACM)
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
A Java fork/join framework

Proceedings of the ACM 2000 conference on Java Grande
From trace generation to visualization: a performance framework for distributed parallel systems

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Statistical scalability analysis of communication operations in distributed applications

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Dynamic statistical profiling of communication activity in distributed applications

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Design and Prototype of a Performance Tool Interface for OpenMP

The Journal of Supercomputing
The Paradyn Parallel Performance Measurement Tool

Computer
Protocol Verification as a Hardware Design Aid

ICCD '92 Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI in Computer & Processors
The Dynamic Probe Class Library: An Infrastucture for Developing Instrumentation for Performance Tools

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
An API for Runtime Code Patching

International Journal of High Performance Computing Applications
Efficient, transparent, and comprehensive runtime code manipulation

Efficient, transparent, and comprehensive runtime code manipulation
Cache oblivious stencil computations

Proceedings of the 19th annual international conference on Supercomputing
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
The cache complexity of multithreaded cache oblivious algorithms

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Intel threading building blocks

Intel threading building blocks
Effective performance measurement and analysis of multithreaded applications

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Binary analysis for measurement and attribution of program performance

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Reducers and other Cilk++ hyperobjects

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Introduction to Algorithms, Third Edition

Introduction to Algorithms, Third Edition
The design of a task parallel library

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
On the partial difference equations of mathematical physics

IBM Journal of Research and Development
Balanced Dense Polynomial Multiplication on Multi-Cores

PDCAT '09 Proceedings of the 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies
The Cilk++ concurrency platform

The Journal of Supercomputing
A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers)

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
A scalable approach to MPI application performance analysis

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
FFT-based dense polynomial arithmetic on multi-cores

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications

Exploiting multicore systems with Cilk

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Parallel computation of the minimal elements of a poset

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Parallelism and data movement characterization of contemporary application classes

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
The pochoir stencil compiler

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Kremlin: rethinking and rebooting gprof for the multicore age

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Parkour: parallel speedup estimates for serial programs

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Kismet: parallel speedup estimates for serial programs

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
DAG3: a tool for design and analysis of applications for multicore architectures

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Harmony: collection and analysis of parallel block vectors

Proceedings of the 39th Annual International Symposium on Computer Architecture
Seeing the futures: profiling shared-memory parallel racket

Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing
On-the-fly pipeline parallelism

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Cilkview scalability analyzer is a software tool for profiling, estimating scalability, and benchmarking multithreaded Cilk++ applications. Cilkview monitors logical parallelism during an instrumented execution of the Cilk++ application on a single processing core. As Cilkview executes, it analyzes logical dependencies within the computation to determine its work and span (critical-path length). These metrics allow Cilkview to estimate parallelism and predict how the application will scale with the number of processing cores. In addition, Cilkview analyzes cheduling overhead using the concept of a "burdened dag," which allows it to diagnose performance problems in the application due to an insufficient grain size of parallel subcomputations. Cilkview employs the Pin dynamic-instrumentation framework to collect metrics during a serial execution of the application code. It operates directly on the optimized code rather than on a debug version. Metadata embedded by the Cilk++ compiler in the binary executable identifies the parallel control constructs in the executing application. This approach introduces little or no overhead to the program binary in normal runs. Cilkview can perform real-time scalability benchmarking automatically, producing gnuplot-compatible output that allows developers to compare an application's performance with the tool's predictions. If the program performs beneath the range of expectation, the programmer can be confident in seeking a cause such as insufficient memory bandwidth, false sharing, or contention rather than inadequate parallelism or insufficient grain size.