ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
ACM Transactions on Computer Systems (TOCS)
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Analyzing multiprocessor cache behavior through data reference modeling
SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
An analytical model of high performance superscalar-based multiprocessors
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Analysis of benchmark characteristics and benchmark performance prediction
ACM Transactions on Computer Systems (TOCS)
Parallelized Direct Execution Simulation of Message-Passing Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
LoPC: modeling contention in parallel algorithms
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Analytic evaluation of shared-memory systems with ILP processors
Proceedings of the 25th annual international symposium on Computer architecture
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Predictive analysis of a wavefront application using LogGP
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Cache miss equations: a compiler framework for analyzing and tuning memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficient performance prediction for modern microprocessors
Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
An analytical model of the working-set sizes in decision-support systems
Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Exact analysis of the cache behavior of nested loops
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes
IEEE Transactions on Computers
Microbenchmarking and Performance Prediction for Parallel
Microbenchmarking and Performance Prediction for Parallel
Theory, Volume 1, Queueing Systems
Theory, Volume 1, Queueing Systems
lmbench: portable tools for performance analysis
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
IEEE Transactions on Knowledge and Data Engineering
Using Information from Prior Runs to Improve Automated Tuning Systems
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Database system performance evaluation models: A survey
Performance Evaluation
Hi-index | 0.00 |
In this paper, we revisit the problem of performance prediction on shared memory parallel machines, motivated by the need for selecting parallelization strategy for random write reductions. Such reductions frequently arise in data mining algorithms.In our previous work, we have developed a number of techniques for parallelizing this class of reductions. Our previous work has shown that each of the three techniques, full replication, optimized full locking, and cache-sensitive, can outperform others depending upon problem, dataset, and machine parameters. Therefore, an important question is, "Can we predict the performance of these techniques for a given problem, dataset, and machine?".This paper addresses this question by developing an analytical performance model that captures a two-level cache, coherence cache misses, TLB misses, locking overheads, and contention for memory. Analytical model is combined with results from micro-benchmarking to predict performance on real machines. We have validated our model on two different SMP machines. Our results show that our model effectively captures the impact of memory hierarchy (two-level cache and TLB) as well as the factors that limit parallelism (contention for locks, memory contention, and coherence cache misses). The difference between predicted and measured performance is within 20% in almost all cases. Moreover, the model is quite accurate in predicting the relative performance of the three parallelization techniques.