Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
A VLIW architecture for a trace Scheduling Compiler
IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
MIPS RISC architectures
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The multiflow trace scheduling compiler
The Journal of Supercomputing - Special issue on instruction-level parallelism
Characterization of alpha AXP performance using TP and SPEC workloads
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Contrasting characteristics and cache performance of technical and multi-user commercial workloads
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Decoupling integer execution in superscalar processors
Proceedings of the 28th annual international symposium on Microarchitecture
Communications of the ACM
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Very Long Instruction Word architectures and the ELI-512
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Instruction scheduling and fetch mechanisms for clustered vliw processors
Instruction scheduling and fetch mechanisms for clustered vliw processors
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Reducing wire delay penalty through value prediction
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Improving Latency Tolerance of Multithreading through Decoupling
IEEE Transactions on Computers
Dynamic Code Partitioning for Clustered Architectures
International Journal of Parallel Programming
Hierarchical Interconnects for On-Chip Clustering
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors
IEEE Transactions on Parallel and Distributed Systems
Instruction Replication for Reducing Delays Due to Inter-PE Communication Latency
IEEE Transactions on Computers
Exploiting idle register classes for fast spill destination
Proceedings of the 22nd annual international conference on Supercomputing
Resource-Driven optimizations for transient-fault detecting superscalar microarchitectures
ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Dynamic register promotion of stack variables
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hi-index | 0.01 |
In conventional superscalar microarchitectures with partitioned integer and floating-point resources, all floating-point resources are idle during execution of integer programs. Palacharla and Smith [26] addressed this drawback and proposed that the floating-point subsystem be augmented to support integer operations. The hardware changes required are expected to be fairly minimal.To exploit these idle floating resources, the compiler must identify integer code that can be profitably offloaded to the augmented floating-point subsystem. In this paper, we present two compiler algorithms to do this. The basic scheme offloads integer computation to the floating-point subsystem using existing program loads/stores for inter-partition communication. For the SPECINT95 benchmarks, we show that this scheme offloads from 5% to 29% of the total dynamic instructions to the floating-point subsystem. The advanced scheme inserts copy instructions and duplicates some instructions to further offload computation. We evaluate the effectiveness of the two schemes using timing simulation. We show that the advanced scheme can offload from 9% to 41% of the total dynamic instructions to the floating-point subsystem. In doing so, speedups from 3% to 23% are achieved over a conventional microarchitecture.