Accelerating sequential programs on commodity multi-core processors

Authors:
Yuanming Zhang;Gang Xiao;Takanobu Baba
Affiliations:
-;-;-
Venue:
Journal of Parallel and Distributed Computing
Year:
2014

Citing 24
Cited 0

Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Clustered speculative multithreaded processors

ICS '99 Proceedings of the 13th international conference on Supercomputing
A Chip-Multiprocessor Architecture with Speculative Multithreading

IEEE Transactions on Computers
The Superthreaded Processor Architecture

IEEE Transactions on Computers
Specifying Concurrent Program Modules

ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Thread Partitioning and Value Prediction for Exploiting Speculative Thread-Level Parallelism

IEEE Transactions on Computers
A General Compiler Framework for Speculative Multithreaded Processors

IEEE Transactions on Parallel and Distributed Systems
Decoupled Software Pipelining with the Synchronization Array

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Helper Threads via Virtual Multithreading

IEEE Micro
Chip Multithreading: Opportunities and Challenges

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Automatically partitioning packet processing applications for pipelined architectures

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Design and Implementation of a Compiler Framework for Helper Threading on Multi-core Processors

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Detecting Conflicts of Interest

RE '06 Proceedings of the 14th IEEE International Requirements Engineering Conference
Support for High-Frequency Streaming in CMPs

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Performance scalability of decoupled software pipelining

ACM Transactions on Architecture and Code Optimization (TACO)
Clustered Decoupled Software Pipelining on Commodity CMP

ICPADS '08 Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed Systems
Clustered Software Queue for Efficient Pipelined Multithreading

PDCAT '09 Proceedings of the 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies
HELIX: automatic parallelization of irregular programs for chip multiprocessing

Proceedings of the Tenth International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

A recently proposed pipelined multithreading (PMT) technique exhibits wide applicability in parallelizing general sequential programs on multi-core processors. However, significant inter-core communication overhead limits PMT performance and prevents its commercial utilization. A simple and effective clustered pipelined multithreading (CPMT) approach is presented to accelerate sequential programs on commodity multi-core processors. This CPMT technique adopts a clustered communication mechanism that can yield very low average communication overhead by eliminating false sharing as well as reducing communication operation and transit delays in the software-only approach. A single-producer/single-consumer concurrent lock-free clusteredQueue algorithm based on a two-level queue structure is also proposed. The accuracy of CPMT is theoretically demonstrated. The performances of the algorithm and CPMT are evaluated on a commodity AMD Phenom four-core processor. The number of enqueue and dequeue times of the algorithm are 20.8 and 23 cycles given an appropriate parameter, respectively. The speedup of CPMT ranges from 13.1% to 119.8% for typical loops extracted from the SPEC CPU 2000 benchmark suite.