SHIFT: shared history instruction fetch for lean-core server processors

Authors:
Cansu Kaynak;Boris Grot;Babak Falsafi
Affiliations:
EcoCloud, EPFL;University of Edinburgh;EcoCloud, EPFL
Venue:
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2013

Citing 43
Cited 1

Instruction fetching: coping with code bloat

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Instruction prefetching of systems codes with layout optimized for reduced cache misses

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads

Proceedings of the 25th annual international symposium on Computer architecture
An analysis of database workload performance on simultaneous multithreaded processors

Proceedings of the 25th annual international symposium on Computer architecture
Cooperative prefetching: compiler and hardware support for effective instruction prefetching in modern processors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Performance of database workloads on shared-memory systems with out-of-order processors

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Fetch directed instruction prefetching

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Execution-based prediction using speculative slices

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Code layout optimizations for transaction processing workloads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
DBMSs on a Modern Processor: Where Does Time Go?

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Instruction Cache Prefetching Using Multilevel Branch Prediction

ISHPC '97 Proceedings of the International Symposium on High Performance Computing
Fetching instruction streams

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Optimizing instruction cache performance for operating system intensive workloads

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Instruction prefetching using branch prediction information

ICCD '97 Proceedings of the 1997 International Conference on Computer Design (ICCD '97)
Detailed Characterization of a Quad Pentium Pro Server Running TPC-D

ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design
SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling

Proceedings of the 30th annual international symposium on Computer architecture
Branch History Guided Instruction Prefetching

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Call graph prefetching for database applications

ACM Transactions on Computer Systems (TOCS)
Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Temporal Streaming of Shared Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
Hardware Support for Prescient Instruction Prefetch

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Data Cache Prefetching Using a Global History Buffer

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
SimFlex: Statistical Sampling of Computer System Simulation

IEEE Micro
Computation spreading: employing hardware migration to specialize CMP cores on-the-fly

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Sequential Program Prefetching in Memory Hierarchies

Computer
Enlarging Instruction Streams

IEEE Transactions on Computers
Steps towards cache-resident transaction processing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Predictor virtualization

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Runahead Execution: An Effective Alternative to Large Instruction Windows

IEEE Micro
Temporal instruction fetch streaming

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Spatio-temporal memory streaming

Proceedings of the 36th annual international symposium on Computer architecture
Vantage: scalable and efficient fine-grain cache partitioning

Proceedings of the 38th annual international symposium on Computer architecture
Clearing the clouds: a study of emerging scale-out workloads on modern hardware

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Proactive instruction fetch

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Many-core key-value store

IGCC '11 Proceedings of the 2011 International Green Computing Conference and Workshops
Scale-out processors

Proceedings of the 39th Annual International Symposium on Computer Architecture
From A to E: analyzing TPC's OLTP benchmarks: the obsolete, the ubiquitous, the unexplored

Proceedings of the 16th International Conference on Extending Database Technology
NOC-Out: Microarchitecting a Scale-Out Processor

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution

Proceedings of the 40th Annual International Symposium on Computer Architecture
RDIP: return-address-stack directed instruction prefetching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

RDIP: return-address-stack directed instruction prefetching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

In server workloads, large instruction working sets result in high L1 instruction cache miss rates. Fast access requirements preclude large instruction caches that can accommodate the deep software stacks prevalent in server applications. Prefetching has been a promising approach to mitigate instruction-fetch stalls by relying on recurring instruction streams of server workloads to predict future instruction misses. By recording and replaying instruction streams from dedicated storage next to each core, stream-based prefetchers have been shown to overcome instruction fetch stalls. Problematically, existing stream-based prefetchers incur high history storage costs resulting from large instruction working sets and complex control flow inherent in server workloads. The high storage requirements of these prefetchers prohibit their use in emerging lean-core server processors. We introduce Shared History Instruction Fetch, SHIFT, an instruction prefetcher suitable for lean-core server processors. By sharing the history across cores, SHIFT minimizes the cost per core without sacrificing miss coverage. Moreover, by embedding the shared instruction history in the LLC, SHIFT obviates the need for dedicated instruction history storage, while transparently enabling multiple instruction histories in the presence of workload consolidation. In a 16-core server CMP, SHIFT eliminates 81% (up to 93%) of instruction cache misses, achieving 19% (up to 42%) speedup on average. SHIFT captures 90% of the performance benefit of the state-of-the-art instruction prefetcher at 14x less storage cost.