FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

Authors:
John Giacomoni;Tipp Moseley;Manish Vachharajani
Affiliations:
University of Colorado at Boulder, Boulder, CO, USA;University of Colorado at Boulder, Boulder, CO, USA;University of Colorado at Boulder, Boulder, CO, USA
Venue:
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Year:
2008

Citing 27
Cited 23

Threads and input/output in the synthesis kernal

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Linearizability: a correctness condition for concurrent objects

ACM Transactions on Programming Languages and Systems (TOPLAS)
Detecting violations of sequential consistency

SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
Lock-free data structures

Lock-free data structures
Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors

Journal of Parallel and Distributed Computing
Specifying Concurrent Program Modules

ACM Transactions on Programming Languages and Systems (TOPLAS)
Decoupled access/execute computer architectures

ACM Transactions on Computer Systems (TOCS)
A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A stateless, content-directed data prefetching mechanism

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Shared Memory Consistency Models: A Tutorial

Computer
A Nonblocking Algorithm for Shared Queues Using Compare-and-Swap

IEEE Transactions on Computers
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Automatic Extraction of Functional Parallelism from Ordinary Programs

IEEE Transactions on Parallel and Distributed Systems
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
A low-overhead coherence solution for multiprocessors with private cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Decoupled Software Pipelining with the Synchronization Array

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Multicores from the Compiler's Perspective: A Blessing or a Curse?

Proceedings of the international symposium on Code generation and optimization
Automatically partitioning packet processing applications for pipelined architectures

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Using elimination to implement scalable and lock-free FIFO queues

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Overcoming the memory wall in packet processing: hammers or ladders?

Proceedings of the 2005 ACM symposium on Architecture for networking and communications systems
Automatic Thread Extraction with Decoupled Software Pipelining

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Scalable synchronous queues

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Exploiting locality to ameliorate packet queue contention and serialization

Proceedings of the 3rd conference on Computing frontiers
From Sequential Programs to Concurrent Threads

IEEE Computer Architecture Letters
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
ETA: Experience with an Intel Xeon Processor as a Packet Processing Engine

IEEE Micro

Frame shared memory: line-rate networking on commodity hardware

Proceedings of the 3rd ACM/IEEE Symposium on Architecture for networking and communications systems
Visualizing potential parallelism in sequential programs

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Serialization sets: a dynamic dependence-based parallel execution model

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Practice of parallelizing network applications on multi-core architectures

Proceedings of the 23rd international conference on Supercomputing
A concurrent dynamic analysis framework for multicore hardware

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Speculative parallelization using software multi-threaded transactions

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Decoupled software pipelining creates parallelization opportunities

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Erbium: a deterministic, concurrent intermediate representation to map data-flow tasks to scalable, persistent streaming processes

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Distributed stream processing with DUP

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
A lock-free, cache-efficient shared ring buffer for multi-core architectures

Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Scalable Graph Exploration on Multicore Processors

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Cache-aware lock-free queues for multiple producers/consumers and weak memory consistency

OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
Cruiser: concurrent heap buffer overflow monitoring using lock-free data structures

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
A GPU-based high-throughput image retrieval algorithm

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Efficient frequent item counting in multi-core hardware

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
An efficient unbounded lock-free queue for multi-core systems

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Understanding the performance of concurrent data structures on graphics processors

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Understanding parallelism in graph traversal on multi-core clusters

Computer Science - Research and Development
On-the-fly pipeline parallelism

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Deterministic scale-free pipeline parallelism with hyperqueues

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Load-balanced pipeline parallelism

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Well-structured futures and cache locality

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Accelerating sequential programs on commodity multi-core processors

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents FastForward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models. Enqueue and dequeue times on a 2.66 GHz Opteron 2218 based system are as low as 28.5 ns, up to 5x faster than the next best solution. FastForward's effectiveness is demonstrated for real applications by applying it to line-rate soft network processing on Gigabit Ethernet with general purpose commodity hardware.