Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
A unified vector/scalar floating-point architecture
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
High-bandwidth data memory systems for superscalar processors
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Data cache performance of supercomputer applications
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
An architecture for software-controlled data prefetching
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Register allocation for software pipelined loops
PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
ICS '92 Proceedings of the 6th international conference on Supercomputing
Pseudo vector processor based on register-windowed superscalar pipeline
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
ACM Computing Surveys (CSUR)
A Fortran compiler for the FPS-164 scientific computer
SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Lockup-free instruction fetch/prefetch cache organization
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Performance of the IPSC/860 Node Architecture
Performance of the IPSC/860 Node Architecture
CP-PACS: a massively parallel processor for large scale scientific calculations
ICS '97 Proceedings of the 11th international conference on Supercomputing
Heterogeneous multi-computer system: a new platform for multi-paradigm scientific simulation
ICS '02 Proceedings of the 16th international conference on Supercomputing
Design and implementation of FMPL, a fast message-passing library for remote memory operations
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Architecture and Performance of the Hitachi SR2201 Massively Parallel Processor System
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Performance evaluation of CP-PACS on CG benchmark
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Performance Improvement for Matrix Calculation on CP-PACS Node Processor
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
The Architecture of Massively Parallel Processor CP-PACS
PAS '97 Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis
Hi-index | 0.00 |
In this paper, we present a new scalar architecture for high-speed vector processing. Without using cache memory, the proposed architecture tolerates main memory access latency by introducing slide-windowed floating-point registers with data preloading feature and pipelined memory. The architecture can hold upward compatibility with existing scalar architectures. In the new architecture, software can control the window structure. This is the advantage compared with our previous work of register-windows. Because of this advantage, registers are utilized more flexibly and computational efficiency is largely enhanced. Furthermore, this flexibility helps the compiler to generate efficient object codes easily.We have evaluated its performance on Livermore Fortran Kernels. The evaluation results show that the proposed architecture reduces the penalty of main memory access better than an ordinary scalar processor and a processor with cache prefetching. The proposed architecture with 64 registers tolerates memory access latency of 30 CPU cyles. Compared with our previous work, the proposed architecture hides longer memory access latency with fewer registers.