A cache-based message passing scheme for a shared-bus multiprocessor
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The Stanford Dash Multiprocessor
Computer
Integrating message-passing and shared-memory: early experience
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Integration of message passing and shared memory in the Stanford FLASH multiprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Memory latency reduction via data prefetching and data forwarding in shared memory multiprocessors
Memory latency reduction via data prefetching and data forwarding in shared memory multiprocessors
Architectural mechanisms for explicit communication in shared memory multiprocessors
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
iWarp: anatomy of a parallel computing system
iWarp: anatomy of a parallel computing system
A stream compiler for communication-exposed architectures
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
StreamIt: A Language for Streaming Applications
CC '02 Proceedings of the 11th International Conference on Compiler Construction
Microarchitectural exploration with Liberty
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Inferential queueing and speculative push for reducing critical communication latencies
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Communication mechanisms in shared memory multiprocessors
Communication mechanisms in shared memory multiprocessors
Decoupled Software Pipelining with the Synchronization Array
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
IEEE Transactions on Parallel and Distributed Systems
Automatically partitioning packet processing applications for pipelined architectures
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling
Proceedings of the 32nd annual international symposium on Computer Architecture
Automatic Thread Extraction with Decoupled Software Pipelining
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
A Hybrid Shared Memory/Message Passing Parallel Machine
ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 01
Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection
Proceedings of the International Symposium on Code Generation and Optimization
Communication optimizations for global multi-threaded instruction scheduling
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Performance scalability of decoupled software pipelining
ACM Transactions on Architecture and Code Optimization (TACO)
Throughput-driven synthesis of embedded software for pipelined execution on multicore architectures
ACM Transactions on Embedded Computing Systems (TECS)
On-chip communication and synchronization mechanisms with cache-integrated network interfaces
Proceedings of the 7th ACM international conference on Computing frontiers
ReMAP: A Reconfigurable Heterogeneous Multicore Architecture
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Scalable Speculative Parallelization on Commodity Clusters
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
An automatic thread decomposition approach for pipelined multithreading
International Journal of High Performance Computing and Networking
Accelerating sequential programs on commodity multi-core processors
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
As the industry moves toward larger-scale chip multiprocessors, the need to parallelize applications grows. High inter-thread communication delays, exacerbated by over-stressed high-latency memory subsystems and ever-increasing wire delays, require parallelization techniques to create partially or fully independent threads to improve performance. Unfortunately, developers and compilers alike often fail to find sufficient independent work of this kind. Recently proposed pipelined streaming techniques have shown significant promise for both manual and automatic parallelization. These techniques have wide-scale applicability because they embrace inter-thread dependences (albeit acyclic dependences) and tolerate long-latency communication of these dependences. This paper addresses the lack of architectural support for this type of concurrency, which has blocked its adoption and hindered related language and compiler research. We observe that both manual and automatic techniques create high-frequency streaming threads, with communication occurring every 5 to 20 instructions. Even while easily tolerating inter-thread transit delays, high-frequency communication makes thread performance very sensitive to intrathread delays from the repeated execution of the communication operations. Using this observation, we define the design-space and evaluate several mechanisms to find a better trade-off between performance and operating system, hardware, and design costs. From this, we find a light-weight streaming-aware enhancement to conventional memory subsystems that doubles the speed of these codes and is within 2% of the best-performing, but heavy-weight, hardware solution.