Decoupled Software Pipelining with the Synchronization Array

Authors:
Ram Rangan;Neil Vachharajani;Manish Vachharajani;David I. August
Affiliations:
Princeton University;Princeton University;Princeton University;Princeton University
Venue:
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Year:
2004

Citing 23
Cited 33

Computer Technology and Architecture: An Evolving Interaction

Computer
Beyond induction variables: detecting and classifying sequences using a demand-driven SSA form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Streamlining inter-operation memory communication via data dependence prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Effective jump-pointer prefetching for linked data structures

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A scalable approach to thread-level speculation

Proceedings of the 27th annual international symposium on Computer architecture
Architectural support for scalable speculative parallelization in shared-memory multiprocessors

Proceedings of the 27th annual international symposium on Computer architecture
Execution-based prediction using speculative slices

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Speculative precomputation: long-range prefetching of delinquent loads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Dynamic speculative precomputation

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A Single-Chip Multiprocessor

Computer
Hierarchical Scheduling Windows

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Microarchitectural exploration with Liberty

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Decoupled access/execute computer architectures

ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Register Renaming and Scheduling for Dynamic Execution of Predicated Code

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Memory Latency-Tolerance Approaches for Itanium Processors: Out-of-Order Execution vs.Speculative Precomputation

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Improving Value Communication for Thread-Level Speculation

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Beating in-order stalls with "flea-flicker" two-pass pipelining

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture

Coordinated, distributed, formal energy management of chip multiprocessors

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Automatic Thread Extraction with Decoupled Software Pipelining

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Chip multi-processor scalability for single-threaded applications

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Hardware-modulated parallelism in chip multiprocessors

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Support for High-Frequency Streaming in CMPs

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Code and data partitioning for fine-grain parallelism

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
SPRINT: a tool to generate concurrent transaction-level models from sequential code

EURASIP Journal on Applied Signal Processing
FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Communication optimizations for global multi-threaded instruction scheduling

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Parallel-stage decoupled software pipelining

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Performance scalability of decoupled software pipelining

ACM Transactions on Architecture and Code Optimization (TACO)
Visualizing potential parallelism in sequential programs

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Input-driven dynamic execution prediction of streaming applications

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
NB-FEB: A Universal Scalable Easy-to-Use Synchronization Primitive for Manycore Architectures

OPODIS '09 Proceedings of the 13th International Conference on Principles of Distributed Systems
Flexible architectural support for fine-grain scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Concurrent separation logic for pipelined parallelization

SAS'10 Proceedings of the 17th international conference on Static analysis
ReMAP: A Reconfigurable Heterogeneous Multicore Architecture

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
SpiceC: scalable parallelism via implicit copying and explicit commit

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Analysis of execution efficiency in the microthreaded processor UTLEON3

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Extending synchronization constructs in openMP to exploit pipeline parallelism on heterogeneous multi-core

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Paragon: collaborative speculative loop execution on GPU and CPU

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
The polyhedral model is more widely applicable than you think

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Single thread program parallelism with dataflow abstracting thread

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Characteristics of workloads using the pipeline programming model

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Yada: Straightforward parallel programming

Parallel Computing
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
On-the-fly pipeline parallelism

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Fast condensation of the program dependence graph

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Load-balanced pipeline parallelism

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
An automatic thread decomposition approach for pipelined multithreading

International Journal of High Performance Computing and Networking
ASC: automatically scalable computation

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Integrating profile-driven parallelism detection and machine-learning-based mapping

ACM Transactions on Architecture and Code Optimization (TACO)
Accelerating sequential programs on commodity multi-core processors

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Despite the success of instruction-level parallelism (ILP) optimizations in increasing the performance of microprocessors, certain codes remain elusive. In particular, codes containing recursive data structure (RDS) traversal loops have been largely immune to ILP optimizations, due to the fundamental serialization and variable latency of the loop-carried dependence through a pointer-chasing load. To address these and other situations, we introduce decoupled software pipelining (DSWP), a technique that statically splits a single-threaded sequential loop into multiple non-speculative threads, each of which performs useful computation essential for overall program correctness. The resulting threads execute on thread-parallel architectures such as simultaneous multithreaded (SMT) cores or chip multiprocessors (CMP), expose additional instruction level parallelism, and tolerate latency better than the original single-threaded RDS loop. To reduce overhead, these threads communicate using a synchronization array, a dedicated hardware structure for pipelined inter-thread communication. DSWP used in conjunction with the synchronization array achieves an 11% to 76% speedup in the optimized functions on both statically and dynamically scheduled processors.