An Analytical Method for Predicting the Performance of Parallel Image Processing Operations
The Journal of Supercomputing
Parallel performance prediction using lost cycles analysis
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Integrating Automatic Techniques in a Performance Analysis Session (Research Note)
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Relating the Execution Behaviour with the Structure of the Application
Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Analysis of input-dependent program behavior using active profiling
Proceedings of the 2007 workshop on Experimental computer science
Analysis of input-dependent program behavior using active profiling
ecs'07 Experimental computer science on Experimental computer science
Scientific Programming - Parallel Computing Projects of the Swiss Priority Programme
Hi-index | 0.00 |
Traditional performance debugging and tuning of parallel programs is based on the "measure-modify" approach, in which detailed measurements of program executions are used to guide incremental changes to the program that result in better performance. Unfortunately, the performance of a parallel algorithm is often related to its implementation, input data, and machine characteristics in surprising ways, and the "measure-modify" approach is unsuited to exploring these relationships fully: it is too heavily dependent on experimentation and measurement, which is impractical for studying the large number of variables that can affect parallel program performance. In this paper we argue that the problem of selecting the best implementation of a parallel algorithm requires a new approach to parallel program performance evaluation, one with a greater balance between measurement and modeling. We first present examples that demonstrate that different parallelizations of a program may be necessary to achieve the best possible performance as one varies the input data, machine architecture, or number of processors used. We then present an approach to performance evaluation based on lost cycles analysis, which involves measurement and modeling of all sources of overhead in a parallel program. We describe a measurement tool for lost cycles analysis that we have incorporated into the runtime environment for Fortran programs on the Kendall Square KSR1, and use this tool to analyze the performance tradeoffs among implementations of 2D FFT and parallel subgraph isomorphism. Using these examples, we show how lost cycles analysis can be used to solve the problems associated with selecting the best implementation in a variable environment. In addition, we show that this approach can capture large amounts of performance data using only a small number of measurements, and that it is flexible enough to allow conclusions to be drawn from empirical data in some cases, and analytic results in other cases.