Speedup Versus Efficiency in Parallel Systems
IEEE Transactions on Computers
Quartz: a tool for tuning parallel program performance
SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Performance debugging shared memory multiprocessor programs with MTOOL
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Space-Efficient Scheduling of Multithreaded Computations
SIAM Journal on Computing
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
The Parallel Evaluation of General Arithmetic Expressions
Journal of the ACM (JACM)
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
Proceedings of the ACM 2000 conference on Java Grande
From trace generation to visualization: a performance framework for distributed parallel systems
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Statistical scalability analysis of communication operations in distributed applications
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Dynamic statistical profiling of communication activity in distributed applications
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Design and Prototype of a Performance Tool Interface for OpenMP
The Journal of Supercomputing
Protocol Verification as a Hardware Design Aid
ICCD '92 Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI in Computer & Processors
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
An API for Runtime Code Patching
International Journal of High Performance Computing Applications
Efficient, transparent, and comprehensive runtime code manipulation
Efficient, transparent, and comprehensive runtime code manipulation
Cache oblivious stencil computations
Proceedings of the 19th annual international conference on Supercomputing
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
The cache complexity of multithreaded cache oblivious algorithms
Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Intel threading building blocks
Intel threading building blocks
Effective performance measurement and analysis of multithreaded applications
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Binary analysis for measurement and attribution of program performance
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Reducers and other Cilk++ hyperobjects
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Introduction to Algorithms, Third Edition
Introduction to Algorithms, Third Edition
The design of a task parallel library
Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
On the partial difference equations of mathematical physics
IBM Journal of Research and Development
Balanced Dense Polynomial Multiplication on Multi-Cores
PDCAT '09 Proceedings of the 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies
The Cilk++ concurrency platform
The Journal of Supercomputing
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
A scalable approach to MPI application performance analysis
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
FFT-based dense polynomial arithmetic on multi-cores
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Exploiting multicore systems with Cilk
Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Parallel computation of the minimal elements of a poset
Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Parallelism and data movement characterization of contemporary application classes
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Kremlin: rethinking and rebooting gprof for the multicore age
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Parkour: parallel speedup estimates for serial programs
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Kismet: parallel speedup estimates for serial programs
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
DAG3: a tool for design and analysis of applications for multicore architectures
Proceedings of the 27th Annual ACM Symposium on Applied Computing
Harmony: collection and analysis of parallel block vectors
Proceedings of the 39th Annual International Symposium on Computer Architecture
Seeing the futures: profiling shared-memory parallel racket
Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing
On-the-fly pipeline parallelism
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Hi-index | 0.00 |
The Cilkview scalability analyzer is a software tool for profiling, estimating scalability, and benchmarking multithreaded Cilk++ applications. Cilkview monitors logical parallelism during an instrumented execution of the Cilk++ application on a single processing core. As Cilkview executes, it analyzes logical dependencies within the computation to determine its work and span (critical-path length). These metrics allow Cilkview to estimate parallelism and predict how the application will scale with the number of processing cores. In addition, Cilkview analyzes cheduling overhead using the concept of a "burdened dag," which allows it to diagnose performance problems in the application due to an insufficient grain size of parallel subcomputations. Cilkview employs the Pin dynamic-instrumentation framework to collect metrics during a serial execution of the application code. It operates directly on the optimized code rather than on a debug version. Metadata embedded by the Cilk++ compiler in the binary executable identifies the parallel control constructs in the executing application. This approach introduces little or no overhead to the program binary in normal runs. Cilkview can perform real-time scalability benchmarking automatically, producing gnuplot-compatible output that allows developers to compare an application's performance with the tool's predictions. If the program performs beneath the range of expectation, the programmer can be confident in seeking a cause such as insufficient memory bandwidth, false sharing, or contention rather than inadequate parallelism or insufficient grain size.