Exploiting idle floating-point resources for integer execution

Authors:
S. Subramanya Sastry;Subbarao Palacharla;James E. Smith
Affiliations:
Computer Sciences Dept., University of Wisconsin-Madison;Computer Sciences Dept., University of Wisconsin-Madison;Dept. of ECE, University of Wisconsin-Madison
Venue:
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Year:
1998

Citing 15
Cited 11

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
A VLIW architecture for a trace Scheduling Compiler

IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
MIPS RISC architectures

MIPS RISC architectures
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The multiflow trace scheduling compiler

The Journal of Supercomputing - Special issue on instruction-level parallelism
Characterization of alpha AXP performance using TP and SPEC workloads

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Contrasting characteristics and cache performance of technical and multi-user commercial workloads

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Decoupling integer execution in superscalar processors

Proceedings of the 28th annual international symposium on Microarchitecture
Intel MMX for multimedia PCs

Communications of the ACM
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Very Long Instruction Word architectures and the ELI-512

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Instruction scheduling and fetch mechanisms for clustered vliw processors

Instruction scheduling and fetch mechanisms for clustered vliw processors

Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Reducing wire delay penalty through value prediction

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Improving Latency Tolerance of Multithreading through Decoupling

IEEE Transactions on Computers
Dynamic Code Partitioning for Clustered Architectures

International Journal of Parallel Programming
Hierarchical Interconnects for On-Chip Clustering

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors

IEEE Transactions on Parallel and Distributed Systems
Instruction Replication for Reducing Delays Due to Inter-PE Communication Latency

IEEE Transactions on Computers
Exploiting idle register classes for fast spill destination

Proceedings of the 22nd annual international conference on Supercomputing
Resource-Driven optimizations for transient-fault detecting superscalar microarchitectures

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Dynamic register promotion of stack variables

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.01

Visualization

Abstract

In conventional superscalar microarchitectures with partitioned integer and floating-point resources, all floating-point resources are idle during execution of integer programs. Palacharla and Smith [26] addressed this drawback and proposed that the floating-point subsystem be augmented to support integer operations. The hardware changes required are expected to be fairly minimal.To exploit these idle floating resources, the compiler must identify integer code that can be profitably offloaded to the augmented floating-point subsystem. In this paper, we present two compiler algorithms to do this. The basic scheme offloads integer computation to the floating-point subsystem using existing program loads/stores for inter-partition communication. For the SPECINT95 benchmarks, we show that this scheme offloads from 5% to 29% of the total dynamic instructions to the floating-point subsystem. The advanced scheme inserts copy instructions and duplicates some instructions to further offload computation. We evaluate the effectiveness of the two schemes using timing simulation. We show that the advanced scheme can offload from 9% to 41% of the total dynamic instructions to the floating-point subsystem. In doing so, speedups from 3% to 23% are achieved over a conventional microarchitecture.