Partitioned register files for VLIWs: a preliminary analysis of tradeoffs
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The multiflow trace scheduling compiler
The Journal of Supercomputing - Special issue on instruction-level parallelism
Iterative modulo scheduling: an algorithm for software pipelining loops
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Region-based compilation: an introduction and motivation
Proceedings of the 28th annual international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Effective cluster assignment for modulo scheduling
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
High-quality operation binding for clustered VLIW datapaths
Proceedings of the 38th annual Design Automation Conference
Affinity-based cluster assignment for unrolled loops
ICS '02 Proceedings of the 16th international conference on Supercomputing
Slack: maximizing performance under technological constraints
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Graph-partitioning based instruction scheduling for clustered processors
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors
IEEE Transactions on Parallel and Distributed Systems
Exploiting Pseudo-Schedules to Guide Data Dependence Graph Partitioning
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Very Long Instruction Word architectures and the ELI-512
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
A New Heuristic for Scheduling Parallel Programs on Multiprocessor
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Instruction Scheduling for Clustered VLIW DSPs
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
CARS: A New Code Generation Framework for Clustered ILP Processors
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Increasing the number of effective registers in a low-power processor using a windowed register file
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized Datapaths
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Partitioning Variables across Register Windows to Reduce Spill Code in a Low-Power Processor
IEEE Transactions on Computers
A Distributed Control Path Architecture for VLIW Processors
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Allocating Non-Real-Time and Soft Real-Time Jobs in Multiclusters
IEEE Transactions on Parallel and Distributed Systems
Compiler-directed Data Partitioning for Multicluster Processors
Proceedings of the International Symposium on Code Generation and Optimization
Compiler-assisted leakage energy optimization for clustered VLIW architectures
EMSOFT '06 Proceedings of the 6th ACM & IEEE International conference on Embedded software
Inter-cluster communication in VLIW architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Virtual Cluster Scheduling Through the Scheduling Graph
Proceedings of the International Symposium on Code Generation and Optimization
Code and data partitioning for fine-grain parallelism
Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Interactive presentation: Time-constrained clustering for DSE of clustered VLIW-ASP
Proceedings of the conference on Design, automation and test in Europe
Modulo scheduling for highly customized datapaths to increase hardware reusability
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Compiling for vector-thread architectures
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays
Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
REDEFINE: Runtime reconfigurable polymorphic ASIC
ACM Transactions on Embedded Computing Systems (TECS)
Compiler-assisted instruction decoder energy optimization for clustered VLIW architectures
HiPC'07 Proceedings of the 14th international conference on High performance computing
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Compiler-assisted power optimization for clustered VLIW architectures
Parallel Computing
Exploring energy-performance trade-offs for heterogeneous interconnect clustered VLIW processors
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
SIMD defragmenter: efficient ILP realization on data-parallel architectures
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Compiler-assisted energy optimization for clustered VLIW processors
Journal of Parallel and Distributed Computing
Compiler supports for VLIW DSP processors with SIMD intrinsics
Concurrency and Computation: Practice & Experience
Low cost control flow protection using abstract control signatures
Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 0.00 |
Clustered architectures are a solution to the bottleneck of centralized register files in superscalar and VLIW processors. The main challenge associated with clustered architectures is compiler support to effectively partition operations across the available resources on each cluster. In this work, we present a novel technique for clustering operations based on graph partitioning methods. Our approach incorporates new methods of assigning weights to nodes and edges within the dataflow graph to guide the partitioner. Nodes are assigned weights to reflect their resource usage within a cluster, while a slack distribution method intelligently assigns weights to edges to reflect the cost of inserting moves across clusters. A multilevel graph partitioning algorithm, which globally divides a dataflow graph into multiple parts in a hierarchical manner, uses these weights to efficiently generate estimates for the quality of partitions. We found that our algorithm was able to achieve an average of 20% improvement in DSP kernels and 5% improvement in SPECint2000 for a four-cluster architecture.