Runtime support for integrating precomputation and thread-level parallelism on simultaneous multithreaded processors

Authors:
Tanping Wang;Filip Blagojevic;Dimitrios S. Nikolopoulos
Affiliations:
The College of William and Mary, McGlothlin-Street Hall, Williamsburg VA;The College of William and Mary, McGlothlin-Street Hall, Williamsburg VA;The College of William and Mary, McGlothlin-Street Hall, Williamsburg VA
Venue:
LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Year:
2004

Citing 16
Cited 5

Tolerating latency through software-controlled data prefetching

Tolerating latency through software-controlled data prefetching
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Limits on the performance benefits of multithreading and prefetching

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Tolerating latency in multiprocessors through compiler-inserted prefetching

ACM Transactions on Computer Systems (TOCS)
Improving the performance of speculatively parallel applications on the Hydra CMP

ICS '99 Proceedings of the 13th international conference on Supercomputing
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Post-pass binary adaptation for software-based speculative precomputation

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Dynamic speculative precomputation

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A general compiler framework for speculative multithreading

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
A quantitative framework for automated pre-execution thread selection

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Exploring the Use of Hyper-Threading Technology for Multimedia Applications with Intel® OpenMP* Compiler

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Speculative Data-Driven Multithreading

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Initial Observations of the Simultaneous Multithreading Pentium 4 Processor

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A study of source-level compiler algorithms for automatic construction of pre-execution code

ACM Transactions on Computer Systems (TOCS)
Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system

Scientific Programming

Exploring the performance limits of simultaneous multithreading for memory intensive applications

The Journal of Supercomputing
Algorithm, software, and hardware optimizations for Delaunay mesh generation on simultaneous multithreaded architectures

Journal of Parallel and Distributed Computing
Evaluating OpenMP on chip multithreading platforms

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Exploring the capacity of a modern SMT architecture to deliver high scientific application performance

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
smt-SPRINTS: software precomputation with intelligent streaming for resource-constrained SMTs

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents runtime mechanisms that enable flexible use of speculative precomputation in conjunction with thread-level parallelism on SMT processors. The mechanisms were implemented and evaluated on a real multi-SMT system. So far, speculative precomputation and thread-level parallelism have been used disjunctively on SMT processors and no attempts have been made to compare and possibly combine these techniques for further optimization. We present runtime support mechanisms for coordinating precomputation with its sibling computation, so that precomputation is regulated to avoid cache pollution and sufficient runahead distance is allowed from the targeted computation. We also present a task queue mechanism to orchestrate precomputation and thread-level parallelism, so that they can be used conjunctively in the same program. The mechanisms are motivated by the observation that different parts of a program may benefit from different modes of multithreaded execution. Furthermore, idle periods during TLP execution or sequential sections can be used for precomputation and vice versa. We apply the mechanisms in loop-structured scientific codes. We present experimental results that verify that no single technique (precomputation or TLP) in isolation achieves the best performance in all cases. Efficient combination of precomputation and TLP is most often the best solution.