Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Authors:
Jack Sampson;Ruben Gonzalez;Jean-Francois Collard;Norman P. Jouppi;Mike Schlansker;Brad Calder
Affiliations:
UC San Diego;UPC Barcelona;Hewlett-Packard Laboratories, Palo Alto, California;Hewlett-Packard Laboratories, Palo Alto, California;Hewlett-Packard Laboratories, Palo Alto, California;UC San Diego
Venue:
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2006

Citing 23
Cited 13

VLSI assist for a multiprocessor

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Efficient synchronization primitives for large-scale cache-coherent multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Fast barrier synchronization hardware

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
The network architecture of the Connection Machine CM-5 (extended abstract)

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
An effective synchronization network for hot-spot accesses

ACM Transactions on Computer Systems (TOCS)
Experience with fine-grain synchronization in MIMD machines for preconditioned conjugate gradient

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Complexity/performance tradeoffs with non-blocking loads

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters

IEEE Transactions on Parallel and Distributed Systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor

Proceedings of the 25th annual international symposium on Computer architecture
System-on-a-chip processor synchronization support in hardware

Proceedings of the conference on Design, automation and test in Europe
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
The NYU Ultracomputer—designing a MIMD, shared-memory parallel machine (Extended Abstract)

ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Fast Synchronization on Scalable Cache-Coherent Multiprocessors using Hybrid Primitives

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Fast Barriers for Scalable ccNUMA Systems

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
High-Performance Throughput Computing

IEEE Micro
IBM Power5 Chip: A Dual-Core Multithreaded Processor

IEEE Micro
Packaging the Blue Gene/L supercomputer

IBM Journal of Research and Development
Design and implementation of message-passing services for the Blue Gene/L supercomputer

IBM Journal of Research and Development

Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Proceedings of the 34th annual international symposium on Computer architecture
Lightweight barrier-based parallelization support for non-cache-coherent MPSoC platforms

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Proceedings of the 36th annual international symposium on Computer architecture
ECMon: exposing cache events for monitoring

Proceedings of the 36th annual international symposium on Computer architecture
Efficient and scalable barrier synchronization for many-core CMPs

Proceedings of the 7th ACM international conference on Computing frontiers
ReMAP: A Reconfigurable Heterogeneous Multicore Architecture

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Low-cost and energy-efficient distributed synchronization for embedded multiprocessors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Parallelization libraries: Characterizing and reducing overheads

ACM Transactions on Architecture and Code Optimization (TACO)
TLSync: support for multiple fast barriers using on-chip transmission lines

Proceedings of the 38th annual international symposium on Computer architecture
Hardware support for OpenMP collective operations

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Speculative optimizations for parallel programs on multicores

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Low-Overhead, high-speed multi-core barrier synchronization

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Single thread program parallelism with dataflow abstracting thread

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

We examine the ability of CMPs, due to their lower onchip communication latencies, to exploit data parallelism at inner-loop granularities similar to that commonly targeted by vector machines. Parallelizing code in this manner leads to a high frequency of barriers, and we explore the impact of different barrier mechanisms upon the efficiency of this approach. To further exploit the potential of CMPs for fine-grained data parallel tasks, we present barrier filters, a mechanism for fast barrier synchronization on chip multi-processors to enable vector computations to be efficiently distributed across the cores of a CMP. We ensure that all threads arriving at a barrier require an unavailable cache line to proceed, and, by placing additional hardware in the shared portions of the memory subsytem, we starve their requests until they all have arrived. Specifically, our approach uses invalidation requests to both make cache lines unavailable and identify when a thread has reached the barrier. We examine two types of barrier filters, one synchronizing through instruction cache lines, and the other through data cache lines.