Handling long-latency loads in a simultaneous multithreading processor

Authors:
Dean M. Tullsen;Jeffery A. Brown
Affiliations:
University of California, San Diego, La Jolla, CA;University of California, San Diego, La Jolla, CA
Venue:
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Year:
2001

Citing 17
Cited 60

Instruction issue logic for high-performance, interruptable pipelined processors

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Analysis of multithreaded architectures for parallel computing

SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
An elementary processor architecture with simultaneous instruction issuing from multiple threads

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Limitations of cache prefetching on a bus-based multiprocessor

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Interleaving: a multithreading technique targeting multiprocessors and workstations

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Increasing superscalar performance through multistreaming

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Execution-based prediction using speculative slices

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Speculative precomputation: long-range prefetching of delinquent loads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Sparcle: An Evolutionary Processor Design for Large-Scale Multiprocessors

IEEE Micro
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture

Dynamic speculative precomputation

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Transparent Threads: Resource Sharing in SMT Processors for High Single-Thread Performance

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Compiling for instruction cache performance on a multithreaded architecture

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Front-End Policies for Improved Issue Efficiency in SMT Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Thread Partitioning and Value Prediction for Exploiting Speculative Thread-Level Parallelism

IEEE Transactions on Computers
Predictable performance in SMT processors

Proceedings of the 1st conference on Computing frontiers
Back-end assignment schemes for clustered multithreaded processors

Proceedings of the 18th annual international conference on Supercomputing
Wire Delay is Not a Problem for SMT (In the Near Future)

Proceedings of the 31st annual international symposium on Computer architecture
Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance

Proceedings of the 31st annual international symposium on Computer architecture
Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor

Proceedings of the 31st annual international symposium on Computer architecture
QoS for High-Performance SMT Processors in Embedded Systems

IEEE Micro
Heat-and-run: leveraging SMT and CMP to manage power density through the operating system

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Dynamically Controlled Resource Allocation in SMT Processors

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Reducing the Soft-Error Rate of a High-Performance Microprocessor

IEEE Micro
Scalable cache memory design for large-scale SMT architectures

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Opportunistic Transient-Fault Detection

Proceedings of the 32nd annual international symposium on Computer Architecture
Tornado warning: the perils of selective replay in multithreaded processors

Proceedings of the 19th annual international conference on Supercomputing
An Instruction Fetch Policy Handling L2 Cache Misses in SMT Processors

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Learning-Based SMT Processor Resource Distribution via Hill-Climbing

Proceedings of the 33rd annual international symposium on Computer Architecture
Adaptive reorder buffers for SMT processors

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Exploiting Operand Availability for Efficient Simultaneous Multithreading

IEEE Transactions on Computers
Fairness and Throughput in Switch on Event Multithreading

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scalable Cache Miss Handling for High Memory-Level Parallelism

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread

Journal of Parallel and Distributed Computing
An L2-miss-driven early register deallocation for SMT processors

Proceedings of the 21st annual international conference on Supercomputing
Fairness enforcement in switch on event multithreading

ACM Transactions on Architecture and Code Optimization (TACO)
Resource area dilation to reduce power density in throughput servers

ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Optimal Power/Performance Pipeline Depth for SMT in Scaled Technologies

IEEE Transactions on Computers
Optimising long-latency-load-aware fetch policies for SMT processors

International Journal of High Performance Computing and Networking
The shared-thread multiprocessor

Proceedings of the 22nd annual international conference on Supercomputing
Software-Controlled Priority Characterization of POWER5 Processor

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
An adaptive resource partitioning algorithm for SMT processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Reducing register pressure in SMT processors through L2-miss-driven early register release

ACM Transactions on Architecture and Code Optimization (TACO)
Hill-climbing SMT processor resource distribution

ACM Transactions on Computer Systems (TOCS)
MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Per-thread cycle accounting in SMT processors

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Memory-level parallelism aware fetch policies for simultaneous multithreading processors

ACM Transactions on Architecture and Code Optimization (TACO)
A swarm-inspired resource distribution for SMT processors

Proceedings of the 3rd International Conference on Bio-Inspired Models of Network, Information and Computing Sytems
The impact of speculative execution on SMT processors

International Journal of Parallel Programming
Service level agreement for multithreaded processors

ACM Transactions on Architecture and Code Optimization (TACO)
Improving SMT performance: an application of genetic algorithms to configure resizable caches

Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking Papers
Enabling software management for multicore caches with a lightweight hardware support

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Probabilistic job symbiosis modeling for SMT processor scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Compiler techniques for reducing data cache miss rate on a multithreaded architecture

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Dynamic warp subdivision for integrated branch and memory divergence tolerance

Proceedings of the 37th annual international symposium on Computer architecture
Efficient runahead threads

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Exploiting unbalanced thread scheduling for energy and performance on a CMP of SMT processors

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Data layout for cache performance on a multithreaded architecture

Transactions on high-performance embedded architectures and compilers III
Managing SMT resource usage through speculative instruction window weighting

ACM Transactions on Architecture and Code Optimization (TACO)
A phase adaptive cache hierarchy for SMT processors

Microprocessors & Microsystems
Enhancing ICOUNT2.8 fetch policy with better fairness for SMT processors

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Do trace cache, value prediction and prefetching improve SMT throughput?

ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
A fetch policy maximizing throughput and fairness for two-context SMT processors

APPT'05 Proceedings of the 6th international conference on Advanced Parallel Processing Technologies
Enhancing DCache warn fetch policy for SMT processors

ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
Improving GPU performance via large warps and two-level warp scheduling

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Probabilistic modeling for job symbiosis scheduling on SMT processors

ACM Transactions on Architecture and Code Optimization (TACO)
Constructing large and fast multi-level cell STT-MRAM based cache for embedded processors

Proceedings of the 49th Annual Design Automation Conference
FROCM: a fair and low-overhead method in SMT processor

HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications

Quantified Score

Hi-index	0.01

Visualization

Abstract

Simultaneous multithreading architectures have been defined previously with fully shared execution resources. When one thread in such an architecture experiences a very long-latency operation, such as a load miss, the thread will eventually stall, potentially holding resources which other threads could be using to make forward progress.This paper shows that in many cases it is better to free the resources associated with a stalled thread rather than keep that thread ready to immediately begin execution upon return of the loaded data. Several possible architectures are examined, and some simple solutions are shown to be very effective, achieving speedups close to 6.0 in some cases, and averaging 15% speedup with four threads and over 100% speedup with two threads running. Response times are cut in half for several workloads in open system experiments.