Tolerating latency through software-controlled prefetching in shared-memory multiprocessors
Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
MOB forms: a class of multilevel block algorithms for dense linear algebra operations
ICS '94 Proceedings of the 8th international conference on Supercomputing
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
The KAP parallelizer for DEC Fortran and DEC C programs
Digital Technical Journal
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
A quantitative analysis of loop nest locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Unroll-and-jam using uniformly generated sets
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Load latency tolerance in dynamically scheduled processors
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
ACM Computing Surveys (CSUR)
Exact analysis of the cache behavior of nested loops
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Accuracy and Speedup of Parallel Trace-Driven Architectural Simulation
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Hierarchical tiling for improved superscalar performance
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Quantifying the Multi-level Nature of Tiling Interactions
LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing
A Comparison of Compiler Tiling Algorithms
CC '99 Proceedings of the 8th International Conference on Compiler Construction, Held as Part of the European Joint Conferences on the Theory and Practice of Software, ETAPS'99
Memory Hierarchy Considerations for Fast Transpose and Bit-Reversals
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Combining Optimization for Cache and Instruction-Level Parallelism
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Applications of storage mapping optimization to register promotion
Proceedings of the 18th annual international conference on Supercomputing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Statistical Models for Empirical Search-Based Performance Tuning
International Journal of High Performance Computing Applications
Facilitating the search for compositions of program transformations
Proceedings of the 19th annual international conference on Supercomputing
Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies
International Journal of Parallel Programming
Systematic search within an optimisation space based on Unified Transformation Framework
International Journal of Computational Science and Engineering
Iterative collective loop fusion
CC'06 Proceedings of the 15th international conference on Compiler Construction
Hi-index | 0.00 |
As the complexity of processor architectures increases, there is a widening gap between peak processor performance and sustained processor performance so that programs now tend to exploit only a fraction of available performance. While there is a tremendous amount of literature on program optimizations, compiler optimizations lack efficiency because they are plagued by three flaws: (1) they often implicitly use simplified, if not simplistic, models of processor architecture, (2) they usually focus on a single processor component (e.g., cache) and ignore the interactions among multiple components, (3) the most heavily investigated components (e.g., caches) sometimes have only a small impact on overall performance. Through the in-depth analysis of a simple program kernel, we want to show that understanding the complex interactions between programs and the numerous processor architecture components is both feasible and critical to design efficient program optimizations.