A first glance at Kilo-instruction based multiprocessors
Proceedings of the 1st conference on Computing frontiers
A case for resource-conscious out-of-order processors: towards kilo-instruction in-flight processors
MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Toward kilo-instruction processors
ACM Transactions on Architecture and Code Optimization (TACO)
An analysis of a resource efficient checkpoint architecture
ACM Transactions on Architecture and Code Optimization (TACO)
Evaluating kilo-instruction multiprocessors
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Scalable Load and Store Processing in Latency Tolerant Processors
Proceedings of the 32nd annual international symposium on Computer Architecture
Instruction packing: reducing power and delay of the dynamic scheduling logic
ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
High-Performance Throughput Computing
IEEE Micro
A Family of Mechanisms for Congestion Control in Wormhole Networks
IEEE Transactions on Parallel and Distributed Systems
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Power-Efficient Wakeup Tag Broadcast
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Incremental Commit Groups for Non-Atomic Trace Processing
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Speculative execution for hiding memory latency
MEDEA '04 Proceedings of the 2004 workshop on MEmory performance: DEaling with Applications , systems and architecture
Kilo-instruction processors, runahead and prefetching
Proceedings of the 3rd conference on Computing frontiers
Tolerating Dependences Between Large Speculative Threads Via Sub-Threads
Proceedings of the 33rd annual international symposium on Computer Architecture
Instruction packing: Toward fast and energy-efficient instruction scheduling
ACM Transactions on Architecture and Code Optimization (TACO)
CAVA: Using checkpoint-assisted value prediction to hide L2 misses
ACM Transactions on Architecture and Code Optimization (TACO)
A simple speculative load control mechanism for energy saving
MEDEA '06 Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures
Exploiting Operand Availability for Efficient Simultaneous Multithreading
IEEE Transactions on Computers
Scalable Cache Miss Handling for High Memory-Level Parallelism
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Hardware atomicity for reliable software speculation
Proceedings of the 34th annual international symposium on Computer architecture
BulkSC: bulk enforcement of sequential consistency
Proceedings of the 34th annual international symposium on Computer architecture
Transparent control independence (TCI)
Proceedings of the 34th annual international symposium on Computer architecture
Mechanisms for bounding vulnerabilities of processor structures
Proceedings of the 34th annual international symposium on Computer architecture
On reducing energy-consumption by late-inserting instructions into the issue queue
ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Energy saving through a simple load control mechanism
ACM SIGARCH Computer Architecture News
Hiding the misprediction penalty of a resource-efficient high-performance processor
ACM Transactions on Architecture and Code Optimization (TACO)
International Journal of High Performance Computing and Networking
Communications of the ACM - Web science
Focused prefetching: performance oriented prefetching based on commit stalls
Proceedings of the 22nd annual international conference on Supercomputing
A Two-Level Load/Store Queue Based on Execution Locality
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Streamlining long latency instructions for seamlessly combined out-of-order and in-order execution
Microprocessors & Microsystems
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
On the potential of latency tolerant execution in speculative multithreading
IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
An energy-efficient checkpointing mechanism for out of order commit processor
Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
Exploiting execution locality with a decoupled Kilo-instruction processor
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Non-uniform instruction scheduling
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Instruction recirculation: eliminating counting logic in wakeup-free schedulers
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Idempotent processor architecture
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
PACS'04 Proceedings of the 4th international conference on Power-Aware Computer Systems
Disjoint out-of-order execution processor
ACM Transactions on Architecture and Code Optimization (TACO)
Implicit transactional memory in kilo-instruction multiprocessors
ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems
Tuning the continual flow pipeline architecture
Proceedings of the 27th international ACM conference on International conference on supercomputing
Virtual register renaming: energy efficient substrate for continual flow pipelines
Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI
Tuning the continual flow pipeline architecture with virtual register renaming
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Modern out-of-order processors tolerate long latency memory operations by supporting a large number of in-flight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data soon enough. In order to support more in-flight instructions, several resources have to be up-sized, such as the Reorder Buffer (ROB), the general purpose instructions queues, the Load/Store queue and the number of physical registers in the processor. However, scaling-up the number of entries in these resources is impractical because of area, cycle time, and power consumption constraints. In this paper we propose to increase the capacity of future processors by augmenting the number of in-flight instructions. Instead of simply up-sizing resources, we push for new and novel microarchitectural structures that achieve the same performance benefits but with a much lower need for resources. Our main contribution is a new checkpointing mechanism that is capable of keeping thousands of in-flight instructions at a practically constant cost. We also propose a queuing mechanism that takes advantage of the differences in waiting time of the instructions in the flow. Using these two mechanisms our processor has a performance degradation of only 10% for SPEC2000fp over a conventional processor requiring more than an order of magnitude additional entries in the ROB and instruction queues, and about a 200% improvement over a current processor with a similar number of entries.