Simultaneous Multithreaded Vector Architecture: Merging ILP and DLP for High Performance

Authors:
Roger Espasa;Mateo Valero
Affiliations:
-;-
Venue:
HIPC '97 Proceedings of the Fourth International Conference on High-Performance Computing
Year:
1997

Citing 0
Cited 5

Improving 3D geometry transformations on a simultaneous multithreaded SIMD processor

ICS '01 Proceedings of the 15th international conference on Supercomputing
Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Multithreaded Extension to Multicluster VLIW Processors for Embedded Applications

Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
Architecture optimization for multimedia application exploiting data and thread-level parallelism

Journal of Systems Architecture: the EUROMICRO Journal
Improving SMT performance scheduling processes

EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The goal of this paper is to show that instructionlevel parallelism (ILP) and data-level parallelism(DLP) can be merged in a single simultaneous vectormultithreaded architecture to execute regular vectorizablecode at a performance level that can not be achieved using either paradigm on its own.We willshow that the combination of the two techniques yieldsvery high performance at a low cost and alow complexity:We will show that this architecture achievesa sustained performance on numerical regular codesthat is 20 times the performance that can be achievedwith today's superscalar microprocessors.Moreover,we will show that the architecture can tolerate verylarge memory latencies, of up to a 100 cycles, witha relatively small performance degradation.This highperformance is independent of working set size or oflocality considerations, since the DLP paradigm allowsvery efficient exploitation of a high performance flatmemory bandwidth.