Support for High-Frequency Streaming in CMPs

Authors:
Ram Rangan;Neil Vachharajani;Adam Stoler;Guilherme Ottoni;David I. August;George Z. N. Cai
Affiliations:
Princeton University;Princeton University;Princeton University;Princeton University;Princeton University;Intel Corporation, Hillsboro, OR
Venue:
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2006

Citing 21
Cited 9

A cache-based message passing scheme for a shared-bus multiprocessor

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The Stanford Dash Multiprocessor

Computer
Integrating message-passing and shared-memory: early experience

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Integration of message passing and shared memory in the Stanford FLASH multiprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Memory latency reduction via data prefetching and data forwarding in shared memory multiprocessors

Memory latency reduction via data prefetching and data forwarding in shared memory multiprocessors
Architectural mechanisms for explicit communication in shared memory multiprocessors

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
iWarp: anatomy of a parallel computing system

iWarp: anatomy of a parallel computing system
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Microarchitectural exploration with Liberty

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Inferential queueing and speculative push for reducing critical communication latencies

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Communication mechanisms in shared memory multiprocessors

Communication mechanisms in shared memory multiprocessors
Decoupled Software Pipelining with the Synchronization Array

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Scalar Operand Networks

IEEE Transactions on Parallel and Distributed Systems
Automatically partitioning packet processing applications for pipelined architectures

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling

Proceedings of the 32nd annual international symposium on Computer Architecture
Automatic Thread Extraction with Decoupled Software Pipelining

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
A Hybrid Shared Memory/Message Passing Parallel Machine

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 01

Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection

Proceedings of the International Symposium on Code Generation and Optimization
Communication optimizations for global multi-threaded instruction scheduling

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Performance scalability of decoupled software pipelining

ACM Transactions on Architecture and Code Optimization (TACO)
Throughput-driven synthesis of embedded software for pipelined execution on multicore architectures

ACM Transactions on Embedded Computing Systems (TECS)
On-chip communication and synchronization mechanisms with cache-integrated network interfaces

Proceedings of the 7th ACM international conference on Computing frontiers
ReMAP: A Reconfigurable Heterogeneous Multicore Architecture

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Scalable Speculative Parallelization on Commodity Clusters

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
An automatic thread decomposition approach for pipelined multithreading

International Journal of High Performance Computing and Networking
Accelerating sequential programs on commodity multi-core processors

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the industry moves toward larger-scale chip multiprocessors, the need to parallelize applications grows. High inter-thread communication delays, exacerbated by over-stressed high-latency memory subsystems and ever-increasing wire delays, require parallelization techniques to create partially or fully independent threads to improve performance. Unfortunately, developers and compilers alike often fail to find sufficient independent work of this kind. Recently proposed pipelined streaming techniques have shown significant promise for both manual and automatic parallelization. These techniques have wide-scale applicability because they embrace inter-thread dependences (albeit acyclic dependences) and tolerate long-latency communication of these dependences. This paper addresses the lack of architectural support for this type of concurrency, which has blocked its adoption and hindered related language and compiler research. We observe that both manual and automatic techniques create high-frequency streaming threads, with communication occurring every 5 to 20 instructions. Even while easily tolerating inter-thread transit delays, high-frequency communication makes thread performance very sensitive to intrathread delays from the repeated execution of the communication operations. Using this observation, we define the design-space and evaluate several mechanisms to find a better trade-off between performance and operating system, hardware, and design costs. From this, we find a light-weight streaming-aware enhancement to conventional memory subsystems that doubles the speed of these codes and is within 2% of the best-performing, but heavy-weight, hardware solution.