A methodology for detailed performance modeling of reduction computations on SMP machines

Authors:
Ruoming Jin;Gagan Agrawal
Affiliations:
Department of Computer and Information Sciences, Ohio State University, Columbus, OH 43210, USA;Department of Computer and Information Sciences, Ohio State University, Columbus, OH 43210, USA
Venue:
Performance Evaluation - Performance modelling and evaluation of high-performance parallel and distributed systems
Year:
2005

Citing 34
Cited 0

An analytical cache model

ACM Transactions on Computer Systems (TOCS)
A methodology for performance evaluation of parallel applications on multiprocessors

Journal of Parallel and Distributed Computing
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The influence of random delays on parallel execution times

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Analyzing multiprocessor cache behavior through data reference modeling

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Parallelized Direct Execution Simulation of Message-Passing Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
An effective and practical performance prediction model for parallel computing on nondedicated heterogeneous NOW

Journal of Parallel and Distributed Computing
Compiler and software distributed shared memory support for irregular applications

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
LoPC: modeling contention in parallel algorithms

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient synchronization: let them eat QOLB

Proceedings of the 24th annual international symposium on Computer architecture
The impact of I/O on program behavior and parallel scheduling

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Analytic evaluation of shared-memory systems with ILP processors

Proceedings of the 25th annual international symposium on Computer architecture
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Predictive analysis of a wavefront application using LogGP

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Adaptive reduction parallelization techniques

Proceedings of the 14th international conference on Supercomputing
A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors

Proceedings of the 14th international conference on Supercomputing
Efficient performance prediction for modern microprocessors

Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
An analytical model of the working-set sizes in decision-support systems

Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Data mining: concepts and techniques

Data mining: concepts and techniques
Parallel data mining for association rules on shared-memory multi-processors

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Parallel data mining for association rules on shared memory systems

Knowledge and Information Systems
Parallel and Distributed Association Mining: A Survey

IEEE Concurrency
Parallel Programming with Polaris

Computer
Maximizing Multiprocessor Performance with the SUIF Compiler

Computer
Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes

IEEE Transactions on Computers
Parallel Mining of Association Rules

IEEE Transactions on Knowledge and Data Engineering
Accurate Performance Prediction for Assively Parallel Systems and Its Applications

Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
On the Automatic Parallelization of Sparse and Irregular Fortran Programs

LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Microbenchmarking and Performance Prediction for Parallel

Microbenchmarking and Performance Prediction for Parallel
Machine Characterization and Benchmark Performance Prediction

Machine Characterization and Benchmark Performance Prediction
Mechanisms for efficient shared-memory, lock-based synchronization

Mechanisms for efficient shared-memory, lock-based synchronization
lmbench: portable tools for performance analysis

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we revisit the problem of performance prediction on SMP machines, motivated by the need for selecting parallelization strategy for random write reductions. Such reductions frequently arise in data mining algorithms. In our previous work, we have developed a number of techniques for parallelizing this class of reductions. Our previous work has shown that each of the three techniques, full replication, optimized full locking, and cache-sensitive, can outperform others depending upon problem, dataset, and machine parameters. Therefore, an important question is, ''Can we predict the performance of these techniques for a given problem, dataset, and machine?''. This paper addresses this question by developing an analytical performance model that captures a two-level cache, coherence cache misses, TLB misses, locking overheads, and contention for memory. Analytical model is combined with results from micro-benchmarking to predict performance on real machines. We have validated our model on two different SMP machines. Our results show that our model effectively captures the impact of memory hierarchy (two-level cache and TLB) as well as the factors that limit parallelism (contention for locks, memory contention, and coherence cache misses). The difference between predicted and measured performance is within 20% in almost all cases. Moreover, the model is quite accurate in predicting the relative performance of the three parallelization techniques.