The multiscalar architecture
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Decoupling integer execution in superscalar processors
Proceedings of the 28th annual international symposium on Microarchitecture
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences
Proceedings of the 24th annual international symposium on Computer architecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Exploiting idle floating-point resources for integer execution
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Effective cluster assignment for modulo scheduling
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Clustered speculative multithreaded processors
ICS '99 Proceedings of the 13th international conference on Supercomputing
Decoupled access/execute computer architectures
ACM Transactions on Computer Systems (TOCS)
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Dynamically managing the communication-parallelism trade-off in future clustered processors
Proceedings of the 30th annual international symposium on Computer architecture
Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors
IEEE Transactions on Parallel and Distributed Systems
Instruction Replication for Reducing Delays Due to Inter-PE Communication Latency
IEEE Transactions on Computers
Hi-index | 0.00 |
Recent works^(1) show that delays introduced in the issue and bypass logic will become critical for wide issue superscalar processors. One of the proposed solutions is clustering the processor core. Clustered architectures benefit from a less complex partitioned processor core and thus, incur in less critical delays. In this paper, we propose a dynamic instruction steering logic for these clustered architectures that decides at decode time the cluster where each instruction is executed. The performance of clustered architectures depends on the inter-cluster communication overhead and the workload balance. We present a scheme that uses runtime information to optimize the trade-off between these figures. The evaluation shows that this scheme can achieve an average speed-up of 35% over a conventional 8-way issue (4 int+4 fp) machine and that it outperforms other previous proposals, either static or dynamic.