The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
ATOM: a system for building customized program analysis tools
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
A quantitative analysis of loop nest locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Data transformations for eliminating conflict misses
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Improving cache Performance Through Tiling and Data Alignment
IRREGULAR '97 Proceedings of the 4th International Symposium on Solving Irregularly Structured Problems in Parallel
Optimization of SIMD Programs with Redundant Computations
Euro-Par '98 Proceedings of the 4th International Euro-Par Conference on Parallel Processing
Handling Cross Interferences by Cyclic Cache Line Coloring
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Access and Alignment of Data in an Array Processor
IEEE Transactions on Computers
A study of interleaved memory systems
AFIPS '70 (Spring) Proceedings of the May 5-7, 1970, spring joint computer conference
Hi-index | 0.00 |
Algorithms which access memory regularly are typical for scientific computing, image processing and multimedia. Cache conflicts are often responsible for performance degradation, but can be avoided by an adequate placement of data in memory. The huge search space for such compile time placements is systematically reduced until we arrive at a class of very simple mappings, well known from data distribution onto processors in parallel computing. The choice of parameters is then guided by a cost function which reflects the tradeoff between additional instruction overhead and reduced miss penalty. We show by experiment that when keeping the overhead low, a considerable speedup can be achieved.