Improving 3D geometry transformations on a simultaneous multithreaded SIMD processor
ICS '01 Proceedings of the 15th international conference on Supercomputing
Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Multithreaded Extension to Multicluster VLIW Processors for Embedded Applications
Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
Architecture optimization for multimedia application exploiting data and thread-level parallelism
Journal of Systems Architecture: the EUROMICRO Journal
Improving SMT performance scheduling processes
EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Hi-index | 0.00 |
The goal of this paper is to show that instructionlevel parallelism (ILP) and data-level parallelism(DLP) can be merged in a single simultaneous vectormultithreaded architecture to execute regular vectorizablecode at a performance level that can not be achieved using either paradigm on its own.We willshow that the combination of the two techniques yieldsvery high performance at a low cost and alow complexity:We will show that this architecture achievesa sustained performance on numerical regular codesthat is 20 times the performance that can be achievedwith today's superscalar microprocessors.Moreover,we will show that the architecture can tolerate verylarge memory latencies, of up to a 100 cycles, witha relatively small performance degradation.This highperformance is independent of working set size or oflocality considerations, since the DLP paradigm allowsvery efficient exploitation of a high performance flatmemory bandwidth.