A close look at vector performance of register-to-register vector computers and a new model
SIGMETRICS '87 Proceedings of the 1987 ACM SIGMETRICS conference on Measurement and modeling of computer systems
An evaluation of Cray X-MP performance on vectorizable Livermore FORTRAN kernels
ICS '88 Proceedings of the 2nd international conference on Supercomputing
Squeezing more CPU performance out of a Cray-2 by Vector block scheduling
Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Communications of the ACM - Special issue on computer architecture
Fundamentals of Computer Alori
Fundamentals of Computer Alori
ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
Improving the throughput of a pipeline by insertion of delays
ISCA '76 Proceedings of the 3rd annual symposium on Computer architecture
High-Bandwidth/Low Latency Temporary Storage for Supercomputers
High-Bandwidth/Low Latency Temporary Storage for Supercomputers
A Performance Comparison of the IBM RS/6000 and the Astronautics ZS-1
Computer - Special issue on experimental research in computer architecture
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Hierarchical performance modeling with MACS: a case study of the convex C-240
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Communication in the KSR1 MPP: performance evaluation using synthetic workload experiments
ICS '94 Proceedings of the 8th international conference on Supercomputing
Vector register design for polycyclic vector scheduling
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A Simulation Study of Decoupled Vector Architectures
The Journal of Supercomputing
Decoupled vector architectures
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Hi-index | 0.00 |
This paper studies the impact of chaining and several instruction scheduling schemes on one-memory-port vector supercomputers, illustrated by the Cray-1 and Cray-2. The lack of instruction chaining in the Cray-2 vector processor requires a different instruction scheduling scheme from that of the Cray-1. Situations are characterized in which simple vector scheduling can generate optimal code, which fully utilizes at least one functional unit for machines with chaining. With enough registers polycyclic scheduling, even without chaining, guarantees full utilization of one functional unit, after an initial transient, for loops with acyclic dependence graphs. Workloads are represented by vectorizable Livermore Fortran Kernels (LFKs). The effectiveness of applying polycyclic scheduling to the Cray-2 is compared with optimal simple vector scheduling on the Cray-1. The speedup of polycyclic vector scheduling on the Cray-2 over the schedule achieved by the current CFT77 compiler on several vectorizable LFKs is also presented.