ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Clustered speculative multithreaded processors
ICS '99 Proceedings of the 13th international conference on Supercomputing
A Chip-Multiprocessor Architecture with Speculative Multithreading
IEEE Transactions on Computers
The Superthreaded Processor Architecture
IEEE Transactions on Computers
Specifying Concurrent Program Modules
ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
StreamIt: A Language for Streaming Applications
CC '02 Proceedings of the 11th International Conference on Compiler Construction
Thread Partitioning and Value Prediction for Exploiting Speculative Thread-Level Parallelism
IEEE Transactions on Computers
A General Compiler Framework for Speculative Multithreaded Processors
IEEE Transactions on Parallel and Distributed Systems
Decoupled Software Pipelining with the Synchronization Array
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Helper Threads via Virtual Multithreading
IEEE Micro
Chip Multithreading: Opportunities and Challenges
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Automatically partitioning packet processing applications for pipelined architectures
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Design and Implementation of a Compiler Framework for Helper Threading on Multi-core Processors
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Detecting Conflicts of Interest
RE '06 Proceedings of the 14th IEEE International Requirements Engineering Conference
Support for High-Frequency Streaming in CMPs
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Performance scalability of decoupled software pipelining
ACM Transactions on Architecture and Code Optimization (TACO)
Clustered Decoupled Software Pipelining on Commodity CMP
ICPADS '08 Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed Systems
Clustered Software Queue for Efficient Pipelined Multithreading
PDCAT '09 Proceedings of the 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies
HELIX: automatic parallelization of irregular programs for chip multiprocessing
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Hi-index | 0.00 |
A recently proposed pipelined multithreading (PMT) technique exhibits wide applicability in parallelizing general sequential programs on multi-core processors. However, significant inter-core communication overhead limits PMT performance and prevents its commercial utilization. A simple and effective clustered pipelined multithreading (CPMT) approach is presented to accelerate sequential programs on commodity multi-core processors. This CPMT technique adopts a clustered communication mechanism that can yield very low average communication overhead by eliminating false sharing as well as reducing communication operation and transit delays in the software-only approach. A single-producer/single-consumer concurrent lock-free clusteredQueue algorithm based on a two-level queue structure is also proposed. The accuracy of CPMT is theoretically demonstrated. The performances of the algorithm and CPMT are evaluated on a commodity AMD Phenom four-core processor. The number of enqueue and dequeue times of the algorithm are 20.8 and 23 cycles given an appropriate parameter, respectively. The speedup of CPMT ranges from 13.1% to 119.8% for typical loops extracted from the SPEC CPU 2000 benchmark suite.