High-performance and low-power VLIW cores for numerical computations

Authors:
Miquel Pericas;Eduard Ayguade;Javier Zalamea;Josep Llosa;Mateo Valero
Affiliations:
Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya, Jordi Girona, 1-3. Modul D6 Compus Nord, 08034 Barcelona, Spain.;Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya, Jordi Girona, 1-3. Modul D6 Compus Nord, 08034 Barcelona, Spain.;Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya, Jordi Girona, 1-3. Modul D6 Compus Nord, 08034 Barcelona, Spain.;Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya, Jordi Girona, 1-3. Modul D6 Compus Nord, 08034 Barcelona, Spain.;Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya, Jordi Girona, 1-3. Modul D6 Compus Nord, 08034 Barcelona, Spain.
Venue:
International Journal of High Performance Computing and Networking
Year:
2004

Citing 11
Cited 0

POWER2: next generation of the RISC System/6000 family

IBM Journal of Research and Development
Hypernode reduction modulo scheduling

Proceedings of the 28th annual international symposium on Microarchitecture
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Lx: a technology platform for customizable VLIW embedded processing

Proceedings of the 27th annual international symposium on Computer architecture
Cost-Conscious Strategies to Increase Performance of Numerical Programs on Aggressive VLIW Architectures

IEEE Transactions on Computers
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Modulo scheduling with integrated register spilling for clustered VLIW architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
The Alpha 21264 Microprocessor

IEEE Micro
The TigerSHARC DSP Architecture

IEEE Micro
Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing

MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
MIRS: modulo scheduling with integrated register spilling

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Issue logic is among the worst scaling structures in a modern microprocessor. Increasing the issue width increments the processor area in an exponential way. Bigger processors will have inherently larger wire delays. In this scenario, technology scaling will yield smaller performance improvements as the wire delays do not decrease. Instead, they start to dominate the clock cycle. In order to offer higher performance the wire problem needs to be tackled. This paper discusses two methods which attempt to move the wire problem out of the critical path. The first method is the clustering technique, which directly approaches the wire problem by combining several smaller execution cores in the processor backend to perform the computations. Each core has a smaller issue width and a much smaller area. The second technique we study is the widening technique. This technique consists in reducing the issue width of the processor, but giving the instructions SIMD capabilities. The parallelism here is small (normally two to four) and does not resemble multimedia or vector extensions. Wide processors use wide functional units that compute the same operation on multiple words. The rationale behind this idea is that by reducing the issue width (but not the computational bandwidth), we are also reducing the issue logic circuitry and the complexity of structures such as the register file and the cache memory. When compared with a centralised core with 128 registers, 8 FPUs and 4 memory ports, our approach, using an equivalent amount of hardware units, is able to achieve speedups up to 1.7.