Advanced compiler optimizations for supercomputers
Communications of the ACM - Special issue on parallelism
The Mahler experience: using an intermediate language as the machine description
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Performance of various computers using standard linear equations software in a FORTRAN environment
ACM SIGARCH Computer Architecture News
The IBM System/370 Vector Architecture: Design Considerations
IEEE Transactions on Computers
Cache performance of vector processors
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
An evaluation of Cray X-MP performance on vectorizable Livermore FORTRAN kernels
ICS '88 Proceedings of the 2nd international conference on Supercomputing
ACM Computing Surveys (CSUR)
On the design of high performance digital arithmetic units
On the design of high performance digital arithmetic units
Architectural and organizational tradeoffs in the design of the MultiTitan CPU
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
ICS '92 Proceedings of the 6th international conference on Supercomputing
Pseudo vector processor based on register-windowed superscalar pipeline
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
A scalar architecture for pseudo vector processing based on slide-windowed registers
ICS '93 Proceedings of the 7th international conference on Supercomputing
IEEE Micro
VICTORIA: VMX indirect compute technology oriented towards in-line acceleration
Proceedings of the 3rd conference on Computing frontiers
ALP: Efficient support for all levels of parallelism for complex media applications
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
In this paper we present a unified approach to vector and scalar computation, using a single register file for both scalar operands and vector elements. The goal of this architecture is to yield improved scalar performance while broadening the range of vectorizable applications. For example, reduction operations and recurrences can be expressed in vector form in this architecture. This approach results in greater overall performance for most applications than does the approach of emphasizing peak vector performance. The hardware required to support the enhanced vector capability is insignificant, but allows the execution of two operations per cycle for vectorized code. Moreover, the size of the unified vector/scalar register file required for peak performance is an order of magnitude smaller than traditional vector register files, allowing efficient on-chip VLSI implementation. The results of simulations of the Livermore Loops and Linpack using this architecture are presented.