An automatic thread decomposition approach for pipelined multithreading

Authors:
Yuanming Zhang;Kanemitsu Ootsu;Takashi Yokota;Takanobu Baba
Affiliations:
College of Computer Science and Technology, Zhejiang University of Technology, 18 Chaowang St., Hangzhou 310014, China;Department of Information Science, Utsunomiya University, 7-1-2 Yoto, Utsunomiya 321-8585, Japan;Department of Information Science, Utsunomiya University, 7-1-2 Yoto, Utsunomiya 321-8585, Japan;Department of Information Science, Utsunomiya University, 7-1-2 Yoto, Utsunomiya 321-8585, Japan
Venue:
International Journal of High Performance Computing and Networking
Year:
2013

Citing 18
Cited 0

The program dependence graph and its use in optimization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Clustered speculative multithreaded processors

ICS '99 Proceedings of the 13th international conference on Supercomputing
A Chip-Multiprocessor Architecture with Speculative Multithreading

IEEE Transactions on Computers
The Superthreaded Processor Architecture

IEEE Transactions on Computers
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Thread Partitioning and Value Prediction for Exploiting Speculative Thread-Level Parallelism

IEEE Transactions on Computers
A General Compiler Framework for Speculative Multithreaded Processors

IEEE Transactions on Parallel and Distributed Systems
Decoupled Software Pipelining with the Synchronization Array

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Automatically partitioning packet processing applications for pipelined architectures

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Automatic Thread Extraction with Decoupled Software Pipelining

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Detecting Conflicts of Interest

RE '06 Proceedings of the 14th IEEE International Requirements Engineering Conference
Support for High-Frequency Streaming in CMPs

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Communication optimizations for global multi-threaded instruction scheduling

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Parallel-stage decoupled software pipelining

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Performance scalability of decoupled software pipelining

ACM Transactions on Architecture and Code Optimization (TACO)
Clustered Software Queue for Efficient Pipelined Multithreading

PDCAT '09 Proceedings of the 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Thread decomposition is critical for pipelined multithreading PMT to gain higher performance on target multi-core processors. This paper presents an automatic thread decomposition approach, which maps the decomposition problem onto a graph-theoretic framework to construct an optimised directed acyclic graph DAG with minimal bottleneck node size and balanced node size. In this approach, control dependence is treated as special data dependence and then an effective approach is proposed to remove redundant control dependences. A weighted DAG is constructed by assigning appropriate weights to all nodes and all dependences according to profile information. An automatic thread decomposition algorithm is given to generate an optimised pipeline based on the weighted DAG. The algorithm has been evaluated on a commodity multi-core processor, and experimental results show that it has achieved speedup ranging from 113% to 174% on some SPEC CPU 2000 benchmark programs.