The Verification of the On-Chip COMA Cache Coherence Protocol
AMAST 2008 Proceedings of the 12th international conference on Algebraic Methodology and Software Technology
Elyze: enabling safe parallelism in event-driven servers
Proceedings of the 8th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering
Formalizing SANE virtual processor in thread algebra
ICFEM'07 Proceedings of the formal engineering methods 9th international conference on Formal methods and software engineering
Strategies for compiling µTC to novel chip Multiprocessors
SAMOS'07 Proceedings of the 7th international conference on Embedded computer systems: architectures, modeling, and simulation
µTC: an intermediate language for programming chip multiprocessors
ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Heterogeneous integration to simplify many-core architecture simulations
Proceedings of the 2012 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools
Unbiased branches: an open problem
ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Apple-CORE: Harnessing general-purpose many-cores with hardware concurrency management
Microprocessors & Microsystems
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
Most microprocessor chips today use an out-of-order instruction execution mechanism. This mechanism allows superscalar processors to extract reasonably high levels of instruction level parallelism (ILP). The most significant problem with this approach is a large instruction window and the logic to support instruction issue from it. This includes generating wake-up signals to waiting instructions and a selection mechanism for issuing them. Wide-issue width also requires a large multi-ported register file, so that each instruction can read and write its operands simultaneously. Neither structure scales well with issue width leading to poor performance relative to the gates used. Furthermore, to obtain this ILP, the execution of instructions must proceed speculatively. An alternative, which avoids this complexity in instruction issue and eliminates speculative execution, is the microthreaded model. This model fragments sequential code at compile time and executes the fragments out of order while maintaining in-order execution within the fragments. The only constraints on the execution of fragments are the dependencies between them, which are managed in a distributed and scalable manner using synchronizing registers. The fragments of code are called microthreads and they capture ILP and loop concurrency. Fragments can be interleaved on a single processor to give tolerance to latency in operands or distributed to many processors to achieve speedup. The implementation of this model is fully scalable. It supports distributed instruction issue and a fully scalable register file, which implements a distributed, shared-register model of communication and synchronization between multiple processors on a single chip. This paper introduces the model, compares it with current approaches and presents an analysis of some of the implementation issues. It also presents results showing scalable performance with issue width over several orders of magnitude, from the same binary code.