Exploiting fine-grained parallelism through a combination of hardware and software techniques
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
MISC: a Multiple Instruction Stream Computer
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The multiscalar architecture
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Control flow prediction with tree-like subgraphs for superscalar processors
Proceedings of the 28th annual international symposium on Microarchitecture
Trace cache: a low latency approach to high bandwidth instruction fetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences
Proceedings of the 24th annual international symposium on Computer architecture
Exploiting instruction level parallelism in processors by caching scheduled groups
Proceedings of the 24th annual international symposium on Computer architecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Understanding some simple processor-performance limits
IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
The MIPS R10000 Superscalar Microprocessor
IEEE Micro
Multiscalar Execution along a Single Flow of Control
ICPP '97 Proceedings of the international Conference on Parallel Processing
The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor
COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
A design space evaluation of grid processor architectures
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Hierarchical Interconnects for On-Chip Clustering
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Efficient Interconnects for Clustered Microarchitectures
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Speculative Clustered Caches for Clustered Processors
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Realizing high IPC through a scalable memory-latency tolerant multipath microarchitecture
ACM SIGARCH Computer Architecture News
Dynamically managing the communication-parallelism trade-off in future clustered processors
Proceedings of the 30th annual international symposium on Computer architecture
Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures
Proceedings of the 18th annual international conference on Supercomputing
On-Chip Interconnects and Instruction Steering Schemes for Clustered Microarchitectures
IEEE Transactions on Parallel and Distributed Systems
Inherently Workload-Balanced Clustered Microarchitecture
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors
IEEE Transactions on Parallel and Distributed Systems
Hi-index | 0.00 |
Recent fascination for dynamic scheduling as a means for exploiting instruction-level parallelism has introduced significant interest in the scalability aspects of dynamic scheduling hardware. In order to overcome the scalability problems of centralized hardware schedulers, many decentralized execution models are being proposed and investigated recently. The crux of all these models is to split the instruction window across multiple processing elements (PEs) that do independent, scheduling of instructions. The decentralized execution models proposed so far can be grouped under 3 categories, based on the criterion used for assigning an instruction to a particular PE. They are: (i) execution unit dependence based decentralization (EDD), (ii) control dependence based decentralization (CDD), and (iii) data dependence based decentralization (DDD). This paper investigates the performance aspects of these three decentralization approaches. Using a suite of important benchmarks and realistic system parameters, we examine performance differences resulting from the type of partitioning as well as from specific implementation issues such as the type of PE interconnect.We found that with a ring-type PE interconnect, the DDD approach performs the best when the number of PEs is moderate, and that the CDD approach performs best when the number of PEs is large. The currently used approach---EDD---does not perform well for any configuration. With a realistic crossbar, performance does not increase with the number of PEs for any of the partitioning approaches. The results give insight into the best way to use the transistor budget available for implementing the instruction window.