Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Some efficient solutions to the affine scheduling problem: I. One-dimensional time
International Journal of Parallel Programming
Compiling for numa parallel machines
Compiling for numa parallel machines
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Transformations of nested loops with non-convex iteration spaces
Parallel Computing
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Improving Cache Locality by a Combination of Loop and Data Transformations
IEEE Transactions on Computers - Special issue on cache memory and related problems
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Generation of Efficient Nested Loops from Polyhedra
International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, part 2
Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design
Precise Data Locality Optimization of Nested Loops
The Journal of Supercomputing
Structure of Computers and Computations
Structure of Computers and Computations
Iteration Space Tiling for Memory Hierarchies
Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Applications of storage mapping optimization to register promotion
Proceedings of the 18th annual international conference on Supercomputing
Facilitating the search for compositions of program transformations
Proceedings of the 19th annual international conference on Supercomputing
Intermediately executed code is the key to find refactorings that improve temporal data locality
Proceedings of the 3rd conference on Computing frontiers
Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies
International Journal of Parallel Programming
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time
Proceedings of the International Symposium on Code Generation and Optimization
Iterative optimization in the polyhedral model: part ii, multidimensional time
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
A practical automatic polyhedral parallelizer and locality optimizer
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Finding and Applying Loop Transformations for Generating Optimized FPGA Implementations
Transactions on High-Performance Embedded Architectures and Compilers I
Precise Management of Scratchpad Memories for Localising Array Accesses in Scientific Codes
CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
Efficient code generation for automatic parallelization and optimization
ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Polyhedral code generation in the real world
CC'06 Proceedings of the 15th international conference on Compiler Construction
Optimizing I/O for big array analytics
Proceedings of the VLDB Endowment
A MapReduce-supported network structure for data centers
Concurrency and Computation: Practice & Experience
Hi-index | 0.00 |
Cache memories were invented to decouple fast processors from slow memories. However, this decoupling is only partial, and many researchers have attempted to improve cache use by program optimization. Potential benefits are significant since both energy dissipation and performance highly depend on the traffic between memory levels. But modeling the traffic is difficult; this observation has led to the use of heuristic methods for steering program transformations. In this paper, we propose another approach: we simplify the cache model and we organize the target program in such a way that an asymptotic evaluation of the memory traffic is possible. This information is used by our optimization algorithm in order to find the best reordering of the program operations, at least in an asymptotic sense. Our method optimizes both temporal and spatial locality. It can be applied to any static control program with arbitrary dependences. The optimizer has been partially implemented and applied to non-trivial programs. We present experimental evidence that the amount of cache misses is drastically reduced with corresponding performance improvements.