Balanced scheduling: instruction scheduling when memory latency is uncertain
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Predicting data cache misses in non-numeric applications through correlation profiling
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
On the scheduling of variable latency functional units
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Reducing cross-coupling among interconnect wires in deep-submicron datapath design
Proceedings of the 36th annual ACM/IEEE Design Automation Conference
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
A case for dynamic pipeline scaling
CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
The Alpha 21264 Microprocessor
IEEE Micro
Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
T3: Trends and Challenges in VLSI Technology Scaling towards 100nm
ASP-DAC '02 Proceedings of the 2002 Asia and South Pacific Design Automation Conference
Pipeline stage unification: a low-energy consumption technique for future mobile processors
Proceedings of the 2003 international symposium on Low power electronics and design
Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Improved clock-gating through transparent pipelining
Proceedings of the 2004 international symposium on Low power electronics and design
Exploring High Bandwidth Pipelined Cache Architecture for Scaled Technology
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Microarchitecture and Design Challenges for Gigascale Integration
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Integrated analysis of power and performance for pipelined microprocessors
IEEE Transactions on Computers
Simulating a LAGS processor to consider variable latency on L1 D-Cache
Proceedings of the 2010 Summer Computer Simulation Conference
Asymmetric-access aware optimization for STT-RAM caches with process variations
Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI
Hi-index | 0.00 |
Variability is one of the important issues in nanoscale processors. Due to increasing importance of interconnect structures in submicron technologies, the physical location and phenomena such as coupling have an increasing impact on the latency of operations. Therefore, traditional view of rigid access latencies to components wil result in suboptimal architectures. In this paper, we devise a cache architecture with variable access latency. Particularly, we a) develop a non-uniform access level 1 data-cache, b) study the impact of coupling and physical location on level 1 data cache access latencies, and c) develop and study an architecture where the variable latency cache can be accessed while the rest of the pipeline remains synchronous. To find the access latency with different input address transitions and environmental conditions, we first build a SPICE model at a 45nm technology for a cache similar to that of the level 1 data cache of the Intel Prescott architecture. Motivated by the large difference between the worst and best case latencies and the shape of the distribution curve, we change the cache architecture to allow variable latency accesses. Since the latency of the cache is not known at the time of instruction scheduling, we also modify the functional units with the addition of special queues that will temporarily store the dependent instructions and allow the data to be forwarded from the cache to the functional units correctly. Simulations based on SPEC2000 benchmarks show that our variable access latency cache structure can reduce the execution time by as much as 19.4% and 10.7% on average compared to a conventional cache architecture.