Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Improving the accuracy of dynamic branch prediction using branch correlation
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
A comparison of dynamic branch predictors that use two levels of branch history
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Branch classification: a new mechanism for improving branch predictor performance
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Improving the accuracy of static branch prediction using branch correlation
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Supporting dynamic data structures on distributed-memory machines
ACM Transactions on Programming Languages and Systems (TOPLAS)
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Compiler-based prefetching for recursive data structures
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Predictability of load/store instruction latencies
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Exploiting hardware performance counters with flow and context sensitive profiling
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
ProfileMe: hardware support for instruction-level profiling on out-of-order processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Informing memory operations: memory performance feedback mechanisms and their applications
ACM Transactions on Computer Systems (TOCS)
The MIPS R10000 Superscalar Microprocessor
IEEE Micro
Optimizing the cache performance of non-numeric applications
Optimizing the cache performance of non-numeric applications
Hi-index | 14.98 |
Latency-tolerance techniques offer the potential for bridging the ever-increasing speed gap between the memory subsystem and today's high-performance processors. However, to fully exploit the benefit of these techniques, one must be careful to apply them only to the dynamic references that are likely to suffer cache misses驴otherwise the runtime overheads can potentially offset any gains. In this paper, we focus on isolating dynamic miss instances in nonnumeric applications, which is a difficult but important problem. Although compilers cannot statically analyze data locality in nonnumeric applications, one viable approach is to use profiling information to measure the actual miss behavior. Unfortunately, the state-of-the-art in cache miss profiling (which we call summary profiling) is inadequate for references with intermediate miss ratios驴it either misses opportunities to hide latency, or else inserts overhead that is unnecessary. To overcome this problem, we propose and evaluate a new profiling technique that helps predict which dynamic instances of a static memory reference will hit or miss in the cache: correlation profiling. Our experimental results demonstrate that roughly half of the 21 nonnumeric applications we study can potentially enjoy significant reductions in memory stall time by exploiting at least one of the three forms of correlation profiling we consider: control-flow correlation, self correlation, and global correlation. In addition, our detailed case studies illustrate that self correlation succeeds because a given reference's cache outcomes often contain repeated patterns and control-flow correlation succeeds because cache outcomes are often call-chain dependent. Finally, we suggest a number of ways to exploit correlation profiling in practice and demonstrate that software prefetching can achieve better performance on a modern superscalar processor when directed by correlation profiling rather than summary profiling information.