Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Exploiting instruction level parallelism in processors by caching scheduled groups
Proceedings of the 24th annual international symposium on Computer architecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
The multicluster architecture: reducing cycle time through partitioning
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Dynamo: a transparent dynamic optimization system
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Inherently Lower-Power High-Performance Superscalar Architectures
IEEE Transactions on Computers
An instruction set and microarchitecture for instruction level distributed processing
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A scalable instruction queue design using dependence chains
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Graph-partitioning based instruction scheduling for clustered processors
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Reducing the complexity of the register file in dynamic superscalar processors
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Dynamic binary translation for accumulator-oriented architectures
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Itanium 2 Processor Microarchitecture
IEEE Micro
Dynamically managing the communication-parallelism trade-off in future clustered processors
Proceedings of the 30th annual international symposium on Computer architecture
Instruction Replication: Reducing Delays Due to Inter-PE Communication Latency
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Instruction Replication for Clustered Microarchitectures
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Macro-op Scheduling: Relaxing Scheduling Loop Constraints
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Ispike: A Post-link Optimizer for the Intel®Itanium®Architecture
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Static strands: safely collapsing dependence chains for increasing embedded power efficiency
LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Achieving Out-of-Order Performance with Almost In-Order Complexity
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Hi-index | 0.00 |
In this paper we explore a new clustering approach for reducing the complexity of wide issue in-order processors based on EPIC architectures. Complexity effectiveness is achieved by heavily clustering the pipeline from decode to commit stage without the need for any direct bypass between clusters. This is made possible by assuming support for executing compiler-constructed traces. One trace is executed at a time by executing its coarse-grained dependency chains (DCs) in different in-order clusters. Since the DCs of a trace are mutually data independent of each other they can be executed in different clusters without any direct communication between them. To execute DCs in narrower clusters without compromising ILP, a compiler algorithm that splits large DCs by duplicating instructions is proposed. Through cycle accurate simulations we show that a DC processor with one 3-wide, one 2-wide and one 1-wide in-order pipeline, could achieve performance equivalent to a 6-wide inorder superscalar processor. Since a clustered DC microarchitecture is complexity efficient, it is amenable to higher clock frequencies and will also be easier to design and validate than a 6-wide monolithic design.