Compiler managed micro-cache bypassing for high performance EPIC processors

Authors:
Youfeng Wu;Ryan Rakvic;Li-Ling Chen;Chyi-Chang Miao;George Chrysos;Jesse Fang
Affiliations:
Intel Corporation, Santa Clara, CA;Intel Corporation, Santa Clara, CA;Intel Corporation, Santa Clara, CA;Intel Corporation, Shrewsbury, MA;Intel Corporation, Shrewsbury, MA;Intel Corporation, Santa Clara, CA
Venue:
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Year:
2002

Citing 18
Cited 6

Dependence-based program analysis

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Sentinel scheduling: a model for compiler-controlled speculative execution

ACM Transactions on Computer Systems (TOCS)
Dynamic memory disambiguation using the memory conflict buffer

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
A modified approach to data cache management

Proceedings of the 28th annual international symposium on Microarchitecture
Wavefront scheduling: path based data representation and scheduling of subgraphs

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Active Management of Data Caches by Exploiting Reuse Information

IEEE Transactions on Computers
Run-Time Cache Bypassing

IEEE Transactions on Computers
Region-based caching: an energy-delay efficient memory architecture for embedded processors

CASES '00 Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems
Focusing processor policies via critical-path prediction

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Locality vs. criticality

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Slack: maximizing performance under technological constraints

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Cache Profiling and the SPEC Benchmarks: A Case Study

Computer
EPIC: Explicitly Parallel Instruction Computing

Computer
Guest Editor's Introduction: Introducing the Itanium Processors

IEEE Micro
Itanium Processor Microarchitecture

IEEE Micro
The Non-Critical Buffer: Using Load Latency Tolerance to Improve Data Cache Efficiency

ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design
Dynamic Prediction of Critical Path Instructions

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
A Master-Slave Adaptive Load-Distribution Processor Model on PCA

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 3 - Volume 04
Distributed Data Cache Designs for Clustered VLIW Processors

IEEE Transactions on Computers
Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Less reused filter: improving l2 cache performance via filtering less reused lines

Proceedings of the 23rd international conference on Supercomputing
An efficient compiler framework for cache bypassing on GPUs

Proceedings of the International Conference on Computer-Aided Design

Quantified Score

Hi-index	0.00

Visualization

Abstract

Advanced microprocessors have been increasing clock rates, well beyond the Gigahertz boundary. For such high performance microprocessors, a small and fast data micro cache (ucache) is important to overall performance, and proper management of it via load bypassing has a significant performance impact. In this paper, we propose and evaluate a hardware-software collaborative technique to manage ucache bypassing for EPIC processors. The hardware supports the ucache bypassing with a flag in the load instruction format, and the compiler employs static analysis and profiling to identify loads that should bypass the ucache. The collaborative method achieves a significant improvement in performance for the SpecInt2000 benchmarks. On average, about 40%, 30%, 24%, and 22% of load references are identified to bypass 256B, 1K, 4K, and 8K sized ucaches, respectively. This reduces the ucache miss rates by 39%, 32%, 28%, and 26%. The number of pipeline stalls from loads to their uses is reduced by 13%, 9%, 6%, and 5%. Meanwhile, the L1 and L2 cache misses remain largely unchanged. For the 256B ucache, bypassing improves overall performance on average by 5%.