Evaluating Associativity in CPU Caches
IEEE Transactions on Computers
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Data cache performance of supercomputer applications
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Counting solutions to Presburger formulas: how and why
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
A quantitative analysis of loop nest locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
ICS '96 Proceedings of the 10th international conference on Supercomputing
Combining loop transformations considering caches and scheduling
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Cache miss equations: an analytical representation of cache misses
ICS '97 Proceedings of the 11th international conference on Supercomputing
Unroll-and-jam using uniformly generated sets
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Data transformations for eliminating conflict misses
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Eliminating conflict misses for high performance architectures
ICS '98 Proceedings of the 12th international conference on Supercomputing
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Augmenting Loop Tiling with Data Alignment for Improved Cache Performance
IEEE Transactions on Computers - Special issue on cache memory and related problems
New tiling techniques to improve cache temporal locality
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
An experimental evaluation of tiling and shackling for memory hierarchy management
ICS '99 Proceedings of the 13th international conference on Supercomputing
A tile selection algorithm for data locality and cache interference
ICS '99 Proceedings of the 13th international conference on Supercomputing
Analytical Modeling of Set-Associative Cache Behavior
IEEE Transactions on Computers
Cache miss equations: a compiler framework for analyzing and tuning memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Locality optimizations for multi-level caches
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Quantifying the Multi-Level Nature of Tiling Interactions
International Journal of Parallel Programming
On Estimating and Enhancing Cache Effectiveness
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Iteration Space Tiling for Memory Hierarchies
Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
A Comparison of Compiler Tiling Algorithms
CC '99 Proceedings of the 8th International Conference on Compiler Construction, Held as Part of the European Joint Conferences on the Theory and Practice of Software, ETAPS'99
A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness
CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
Software Methods for Avoiding Cache Conflicts
Software Methods for Avoiding Cache Conflicts
Predicting the Cache Miss Ratio of Loop-Nested Array References
Predicting the Cache Miss Ratio of Loop-Nested Array References
Unfavorable Strides in Cache Memory Systems (RNR Technical Report RNR-92-015)
Scientific Programming
A New Genetic Algorithm for Loop Tiling
The Journal of Supercomputing
Positivity, posynomials and tile size selection
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Automatic creation of tile size selection models
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Practical loop transformations for tensor contraction expressions on multi-level memory hierarchies
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Tuning blocked array layouts to exploit memory hierarchy in SMT architectures
PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
Analytical bounds for optimal tile size selection
CC'12 Proceedings of the 21st international conference on Compiler Construction
Automatic OpenCL work-group size selection for multicore CPUs
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.01 |
Loop tiling is an effective optimizing transformation to boost the memory performance of a program, especially for dense matrix scientific computations. The magnitude and stability of the achieved performance improvements are heavily dependent on the appropriate selection of tile sizes. Many existing tile selection algorithms try to find tile sizes which eliminate self-interference cache conflict misses, maximize cache utilization, and minimize cross-interference cache conflict misses. These techniques depend heavily on the actual layout of the arrays in memory. Array padding, an effective data layout optimization technique, is therefore incorporated by many algorithms to help loop tiling stabilize its effectiveness by avoiding “pathological” array sizes.In this paper, we examine several such combined algorithms in terms of cost-benefit trade-offs, and introduce a new algorithm. The preliminary experimental results show that more precise and costly tile selection and array padding algorithms may not be justified by the resulting performance improvements since such improvements may also be achieved by much simpler and therefore less expensive strategies. The key issues in finding a good tiling algorithm are (1) to identify critical performance factors and (2) to develop corresponding performance models that allow predictions at a sufficient level of accuracy. Following this insight, we have developed a new tiling algorithm that performs better than previous algorithms in terms of execution time and stability, and generates code with a performance comparable to the best measured algorithm. Experimental results on two standard benchmark kernels for matrix multiply and LU factorization show that the new algorithm is orders of magnitude faster than the best previous algorithm without sacrificing stability and execution speed of the generated code.