Microarchitecture support for improving the performance of load target prediction
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Dynamic IPC/clock rate optimization
Proceedings of the 25th annual international symposium on Computer architecture
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Silent Stores and Store Value Locality
IEEE Transactions on Computers
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Power and performance evaluation of globally asynchronous locally synchronous processors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Reducing set-associative cache energy via way-prediction and selective direct-mapping
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Theoretical Limits on the Data Dependent Performance of Asynchronous Circuits
ASYNC '99 Proceedings of the 5th International Symposium on Advanced Research in Asynchronous Circuits and Systems
Interfacing Synchronous and Asynchronous Modules Within a High-Speed Pipeline
ARVLSI '97 Proceedings of the 17th Conference on Advanced Research in VLSI (ARVLSI '97)
Practical Design and Performance Evaluation of Completion Detection Circuits
ICCD '98 Proceedings of the International Conference on Computer Design
Circuit Implementation of a 600MHz Superscalar RISC Microprocessor
ICCD '98 Proceedings of the International Conference on Computer Design
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Dynamically Trading Frequency for Complexity in a GALS Microprocessor
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
High-performance ULSI: the real limiter to interconnect scaling
Proceedings of the 2005 international workshop on System level interconnect prediction
MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research
IEEE Computer Architecture Letters
Mitigating the Impact of Process Variations on Processor Register Files and Execution Units
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Implications of Device Timing Variability on Full Chip Timing
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Variable latency caches for nanoscale processor
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Characterizing asynchronous variable latencies through probability distribution functions
Microprocessors & Microsystems
Dynamic capacity-speed tradeoffs in SMT processor caches
HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Hi-index | 0.02 |
Variability is one of the important issues in deep-submicron tecnologies, and the assumption of non-variable, constant latencies in the modules of deep-submicron processors can jeopardize their performance. Cache memories have demonstrated their data-dependent latency due to factors like the coupling capacitances or the distance between the port and the required data. In this paper we present, on one hand, a scheme to detect read operation completion on a variable latency cache memory. On the other hand, we present an asynchronous approach to improve processor performance using this feature. Hence, we propose a Locally-Asynchronous Globally-Synchronous (LAGS) superscalar microarchitecture in which read operations on a variable latency L1 data cache are managed through an asynchronous wrapper. In addition, we demonstrate its feasibility running SPEC2000 benchmarks on a 64-bit superscalar processor modeled through an architectural simulator. Simulations show speedups ranging up to 1.44 and averaging 1.22 over a non-variable cache design.