Estimating interlock and improving balance for pipelined architectures
Journal of Parallel and Distributed Computing
Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Improving register allocation for subscripted variables
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Interprocedural transformations for parallel code generation
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Optimizing for parallelism and data locality
ICS '92 Proceedings of the 6th international conference on Supercomputing
Access normalization: loop restructuring for NUMA compilers
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Automatic and interactive parallelization
Automatic and interactive parallelization
Improving locality and parallelism in nested loops
Improving locality and parallelism in nested loops
Scalar replacement in the presence of conditional control flow
Software—Practice & Experience
Memory-hierarchy management
Dependence graphs and compiler optimizations
POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
On Estimating and Enhancing Cache Effectiveness
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Iteration Space Tiling for Memory Hierarchies
Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
A hierarchical basis for reordering transformations
POPL '84 Proceedings of the 11th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Improving the performance of virtual memory computers.
Improving the performance of virtual memory computers.
Unifying data and control transformations for distributed shared-memory machines
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Abstract interpretation and low-level code optimization
PEPM '95 Proceedings of the 1995 ACM SIGPLAN symposium on Partial evaluation and semantics-based program manipulation
Compiler cache optimizations for banded matrix problems
ICS '95 Proceedings of the 9th international conference on Supercomputing
Compiler techniques for data prefetching on the PowerPC
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
SPAID: software prefetching in pointer- and call-intensive environments
Proceedings of the 28th annual international symposium on Microarchitecture
A compiler optimization to reduce execution time of loop nest
ACM SIGARCH Computer Architecture News
The influence of caches on the performance of heaps
Journal of Experimental Algorithmics (JEA)
A quantitative analysis of loop nest locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Value locality and load value prediction
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Compiler-based prefetching for recursive data structures
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Compiler-directed page coloring for multiprocessors
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Examination of a memory access classification scheme for pointer-intensive and numeric programs
ICS '96 Proceedings of the 10th international conference on Supercomputing
Combining loop transformations considering caches and scheduling
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Cache miss equations: an analytical representation of cache misses
ICS '97 Proceedings of the 11th international conference on Supercomputing
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Tuning compiler optimizations for simultaneous multithreading
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Escape analysis: correctness proof, implementation and experimental results
POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The implementation and evaluation of fusion and contraction in array languages
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A general algorithm for tiling the register level
ICS '98 Proceedings of the 12th international conference on Supercomputing
Using generational garbage collection to implement cache-conscious data placement
Proceedings of the 1st international symposium on Memory management
Improving locality using loop and data transformations in an integrated framework
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Schedule-independent storage mapping for loops
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Cache-conscious data placement
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Cache-conscious structure layout
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Load-reuse analysis: design and evaluation
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
A locality sensitive multi-module cache with explicit management
ICS '99 Proceedings of the 13th international conference on Supercomputing
Reducing cache misses using hardware and software page placement
ICS '99 Proceedings of the 13th international conference on Supercomputing
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Escape analysis for object-oriented languages: application to Java
Proceedings of the 14th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Procedure placement using temporal-ordering information
ACM Transactions on Programming Languages and Systems (TOPLAS)
Tuning Compiler Optimizations for Simultaneous Multithreading
International Journal of Parallel Programming - Special issue on the 30th annual ACM/IEEE international symposium on microarchitecture, part II
Data Locality Exploitation in the Decomposition of Regular Domain Problems
IEEE Transactions on Parallel and Distributed Systems
Improving fine-grained irregular shared-memory benchmarks by data reordering
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Exact analysis of the cache behavior of nested loops
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Blocking and array contraction across arbitrarily nested loops using affine partitioning
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
An efficient profile-analysis framework for data-layout optimizations
POPL '02 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Efficient Representation Scheme for Multidimensional Array Operations
IEEE Transactions on Computers
Compiling stencils in high performance Fortran
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Page replacement using marginal loss functions
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Computation regrouping: restructuring programs for temporal data cache locality
ICS '02 Proceedings of the 16th international conference on Supercomputing
Register tiling in nonrectangular iteration spaces
ACM Transactions on Programming Languages and Systems (TOPLAS)
Combining Loop Transformations Considering Caches and Scheduling
International Journal of Parallel Programming
Quantifying the Multi-Level Nature of Tiling Interactions
International Journal of Parallel Programming
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
Probabilistic Miss Equations: Evaluating Memory Hierarchy Performance
IEEE Transactions on Computers
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Memory System Support for Dynamic Cache Line Assembly
IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
On the Parallel Execution Time of Tiled Loops
IEEE Transactions on Parallel and Distributed Systems
Compile-time composition of run-time data and iteration reorderings
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Strategies for Improving Data Locality in Embedded Applications
ASP-DAC '02 Proceedings of the 2002 Asia and South Pacific Design Automation Conference
Guided region prefetching: a cooperative hardware/software approach
Proceedings of the 30th annual international symposium on Computer architecture
IEEE Transactions on Parallel and Distributed Systems
Compiler parallelization of C programs for multi-core DSPs with multiple address spaces
Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Escape analysis for JavaTM: Theory and practice
ACM Transactions on Programming Languages and Systems (TOPLAS)
Automatic code generation for a convection scheme
Proceedings of the 2003 ACM symposium on Applied computing
Exploiting Processor Workload Heterogeneity for Reducing Energy Consumption in Chip Multiprocessors
Proceedings of the conference on Design, automation and test in Europe - Volume 2
A fast and accurate framework to analyze and optimize cache memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Restructuring computations for temporal data cache locality
International Journal of Parallel Programming
The Energy Impact of Aggressive Loop Fusion
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
A Complete Compiler Approach to Auto-Parallelizing C Programs for Multi-DSP Systems
IEEE Transactions on Parallel and Distributed Systems
Locality-Aware Process Scheduling for Embedded MPSoCs
Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
Dynamic memory interval test vs. interprocedural pointer analysis in multimedia applications
ACM Transactions on Architecture and Code Optimization (TACO)
An accurate cost model for guiding data locality transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimizing irregular shared-memory applications for distributed-memory systems
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Code restructuring for improving cache performance of MPSoCs
ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Single-dimension software pipelining for multidimensional loops
ACM Transactions on Architecture and Code Optimization (TACO)
Trade-offs in loop transformations
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Optimizing shared cache behavior of chip multiprocessors
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Composition-based Cache simulation for structure reorganization
Journal of Systems Architecture: the EUROMICRO Journal
Model-guided empirical tuning of loop fusion
International Journal of High Performance Systems Architecture
Multi-port abstraction layer for FPGA intensive memory exploitation applications
Journal of Systems Architecture: the EUROMICRO Journal
Engineering scalable, cache and space efficient tries for strings
The VLDB Journal — The International Journal on Very Large Data Bases
Improving cache locality for thread-level speculation
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Data locality and parallelism optimization using a constraint-based approach
Journal of Parallel and Distributed Computing
Redesigning the string hash table, burst trie, and BST to exploit cache
Journal of Experimental Algorithmics (JEA)
Reducing Network-on-Chip energy consumption through spatial locality speculation
NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
A framework for compiler driven design space exploration for embedded system customization
ASIAN'04 Proceedings of the 9th Asian Computing Science conference on Advances in Computer Science: dedicated to Jean-Louis Lassez on the Occasion of His 5th Cycle Birthday
MiniTasking: improving cache performance for multiple query workloads
WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
A comparative analysis of performance improvement schemes for cache memories
Computers and Electrical Engineering
Optimizing I/O for big array analytics
Proceedings of the VLDB Endowment
Parameterized micro-benchmarking: an auto-tuning approach for complex applications
Proceedings of the 9th conference on Computing Frontiers
Hi-index | 0.01 |
In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs.To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For over thirty complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve because benchmark programs typically have high hit rates even for small data caches; however, our optimizations significantly improved several programs.