Architecture and implementation of a VLIW supercomputer
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Space-time scheduling of instruction-level parallelism on a raw machine
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Modulo scheduling for a fully-distributed clustered VLIW architecture
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
An interleaved cache clustered VLIW processor
ICS '02 Proceedings of the 16th international conference on Supercomputing
Graph-partitioning based instruction scheduling for clustered processors
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Region-based hierarchical operation partitioning for multicluster processors
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Very Long Instruction Word architectures and the ELI-512
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
CARS: A New Code Generation Framework for Clustered ILP Processors
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific)
Cluster assignment of global values for clustered VLIW processors
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Matching on-chip data storage to telecommunication and media application properties
Matching on-chip data storage to telecommunication and media application properties
Variable-Based Multi-module Data Caches for Clustered VLIW Processors
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Optimizing shared cache behavior of chip multiprocessors
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Compiler-assisted data distribution for chip multiprocessors
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 0.00 |
Multicluster architectures overcome the scaling problem of centralized resources by distributing the datapath, register file, and memory subsystem across multiple clusters connected by a communication network. Traditional compiler partitioning algorithms focus solely on distributing operations across the clusters to maximize instruction-level parallelism. The distribution of data objects is generally ignored. In this work, we examine explicit partitioning of data objects and its affects on operation partitioning. The partitioning of data objects must consider several factors: object size, access frequency/pattern, and dependence patterns between operations that manipulate the objects. This work proposes a compiler-directed approach to synergistically partition both data objects and computation across multiple clusters. First, a global view of the application determines the interaction between data memory objects and their associated computation. Next, data objects are partitioned across multiple clusters with knowledge of the associated computation required by the application. Finally, the resulting distribution of the data objects is relayed to a region-level computation partitioner, which carefully places computation operations in a performance- centric manner.