Cost-Conscious Strategies to Increase Performance of Numerical Programs on Aggressive VLIW Architectures

Authors:
David López;Josep Llosa;Mateo Valero;Eduard Ayguadé
Affiliations:
Technical Univ. of Catalunya, Barcelona, Spain;Technical Univ. of Catalunya, Barcelona, Spain;Technical Univ. of Catalunya, Barcelona, Spain;Technical Univ. of Catalunya, Barcelona, Spain
Venue:
IEEE Transactions on Computers
Year:
2001

Citing 26
Cited 2

Register allocation for software pipelined loops

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Code optimizers and register organizations for vector architectures

Code optimizers and register organizations for vector architectures
Instruction-level parallel processing: history, overview, and perspective

The Journal of Supercomputing - Special issue on instruction-level parallelism
Compiling for the Cydra 5

The Journal of Supercomputing - Special issue on instruction-level parallelism
Designing the TFP Microprocessor

IEEE Micro
Iterative modulo scheduling: an algorithm for software pipelining loops

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
POWER2: next generation of the RISC System/6000 family

IBM Journal of Research and Development
Software pipelining

ACM Computing Surveys (CSUR)
Partitioned register file for TTAs

Proceedings of the 28th annual international symposium on Microarchitecture
High-bandwidth address translation for multiple-issue processors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
The case for a single-chip multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Heuristics for register-constrained software pipelining

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Custom-fit processors: letting applications define architectures

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Software pipelining: a comparison and improvement

MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture
Increasing memory bandwidth with wide buses: compiler, hardware and performance trade-offs

ICS '97 Proceedings of the 11th international conference on Supercomputing
Data caches for superscalar processors

ICS '97 Proceedings of the 11th international conference on Supercomputing
Modulo Scheduling with Reduced Register Pressure

IEEE Transactions on Computers
Quantitative Evaluation of Register Pressure on Software Pipelined Loops

International Journal of Parallel Programming
Widening resources: a cost-effective technique for aggressive ILP architectures

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Comparison of Single- and Dual-Pass Multiply-Add Fused Floating-Point Units

IEEE Transactions on Computers
Conversion of control dependence to data dependence

POPL '83 Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing

MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
Impact on Performance of Fused Multiply-Add Units in Aggressive VLIW Architectures

ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing

High-performance and low-power VLIW cores for numerical computations

International Journal of High Performance Computing and Networking
Pre-synthesis resource generation and estimation for transport-triggered architecture (TTA)-like architecture

Microprocessors & Microsystems

Quantified Score

Hi-index	14.99

Visualization

Abstract

Loops are the main time-consuming part of numerical applications. The performance of the loops is limited either by the resources offered by the architecture or by recurrences in the computation. To execute more operations per cycle, current processors are designed with growing degrees of resource replication (replication technique) for memory ports and functional units. However, the high cost in terms of area and cycle time of this technique precludes the use of high degrees of replication. High values for the cycle time may clearly offset any gain in terms of number of execution cycles. High values for the area may lead to an unimplementable configuration. An alternative to resource replication is resource widening (widening technique), which has also been used in some recent designs in which the width of the resources is increased (i.e., a single operation is performed over multiple data). Moreover, several general-purpose superscalar microprocessors have been implemented with multiply-add fused floating-point units (fusion technique), which reduces the latency of the combined operation and the number of resources used. In this paper, we evaluate a broad set of VLIW processor design alternatives that combine the three techniques. We perform a technological projection for the next processor generations in order to foresee the possible implementable alternatives. From this study, we conclude that if the cost is taken into account, combining certain degrees of replication and widening in the hardware resources is more effective than applying only replication. Also, we confirm that multiply-add fused units will have a significant impact in raising the performance of future processors architectures with a reasonable increase in cost.