Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
A fast Fourier transform compiler
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Compiler-supported simulation of highly scalable parallel applications
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Exact analysis of the cache behavior of nested loops
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
SPL: a language and compiler for DSP algorithms
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
High-level adaptive program optimization with ADAPT
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Language and Compiler Support for Adaptive Distributed Applications
OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Automatically characterizing large scale program behavior
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A Performance Prediction Framework for Data Intensive Applications on Large Scale Parallel Machines
LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Asserting performance expectations
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
ECO: An Empirical-Based Compilation and Optimization System
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Efficient Performance Prediction for Large-Scale, Data-Intensive Applications
International Journal of High Performance Computing Applications
Interprocedural parallelization analysis in SUIF
ACM Transactions on Programming Languages and Systems (TOPLAS)
Trust but verify: monitoring remotely executing programs for progress and correctness
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Empirical optimization for a sparse linear solver: a case study
International Journal of Parallel Programming - Special issue: The next generation software program
Auto-tuning full applications: A case study
International Journal of High Performance Computing Applications
Effective source-to-source outlining to support whole program empirical optimization
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Portable section-level tuning of compiler parallelized applications
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Fine-grained Benchmark Subsetting for System Selection
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hi-index | 0.00 |
In this paper, we describe a tool we have developed called a code isolator. We envision such a tool will facilitate many software development activities in complex software systems, but we are using it to isolate code segments from large scientific and engineering codes, for the purposes of performance tuning. The goal of the code isolator is to provide an executable version of a code segment and representative data that mimics the performance of the code in the full program. The resulting isolated code can be used in performance tuning experiments, requiring just a tiny fraction of the execution time of the code when executing within the full program. We describe the analyses and transformations used in a code isolator tool, which we have largely automated in the SUIF compiler. We present a case study of its use with LS-DYNA, a large widely-used engineering application. In this paper, we demonstrate how the tool derives code that permits performance tuning for cache. We present results comparing L1 cache misses and execution time for the original program and the isolated program generated by the tool with some manual intervention. We find that the isolated code can be executed 3600 times faster than the original program, and most of the L1 cache misses are preserved. We identify areas where additional analyses can close the remaining gap in predicting and preserving cache misses in the isolated code.