Strategies for cache and local memory management by global program transformation
Proceedings of the 1st International Conference on Supercomputing
Improving register allocation for subscripted variables
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Optimization of array accesses by collective loop transformations
ICS '91 Proceedings of the 5th international conference on Supercomputing
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A novel cache design for vector processing
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
IEEE Transactions on Computers
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Skewed associativity enhances performance predictability
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
A quantitative analysis of loop nest locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
A compiler algorithm for optimizing locality in loop nests
ICS '97 Proceedings of the 11th international conference on Supercomputing
Cache miss equations: an analytical representation of cache misses
ICS '97 Proceedings of the 11th international conference on Supercomputing
The design and performance of a conflict-avoiding cache
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Compiler blockability of dense matrix factorizations
ACM Transactions on Mathematical Software (TOMS)
Data transformations for eliminating conflict misses
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Improving locality using loop and data transformations in an integrated framework
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Precise miss analysis for program transformations with caches of arbitrary associativity
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
New tiling techniques to improve cache temporal locality
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Proceedings of the 14th international conference on Supercomputing
Collective Loop Fusion for Array Contraction
Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
A Comparison of Compiler Tiling Algorithms
CC '99 Proceedings of the 8th International Conference on Compiler Construction, Held as Part of the European Joint Conferences on the Theory and Practice of Software, ETAPS'99
Optimizing Program Locality Through CMEs and GAs
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Using Prime Numbers for Cache Indexing to Eliminate Conflict Misses
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Hi-index | 0.00 |
Modern microprocessor designs continue to obtain impressive performance gains through increasing clock rates and advances in the parallelism obtained via micro-architecture design. Unfortunately, corresponding improvements in memory design technology have not been realized, resulting in latencies of over 100 cycles between processors and main memory. This ever-increasing gap in speed has pushed the current memory-hierarchy approach to its limit.Traditional approaches to memory-hierarchy management have not yielded satisfactory results. Hardware solutions require more power and energy than desired and do not scale well. Compiler solutions tend to miss too many optimization opportunities because of limited compile-time knowledge of run-time behavior. This paper explores a different approach that combines both approaches by making use of the static knowledge obtained by the compiler in the dynamic decision making of the micro-architecture. We propose a memory-hierarchy design based on working sets that uses compile-time annotations regarding the working set of memory operations to guide cache placement decisionsOur experiments show that a working-set-based memory hierarchy can significantly reduce the miss rate for memory-intensive tiled kernels by limiting cross interference. The working-set-based memory hierarchy allows the compiler to tile many loops without concern for cross interference in the cache, making tile size choice easier. In addition, the compiler can more easily tailor tile choices to the separate needs of different working sets.