Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Authors:
Weirong Zhu;Vugranam C Sreedhar;Ziang Hu;Guang R. Gao
Affiliations:
University of Delaware, Newark, DE;IBM TJ Watson Research Center, Howthorne, NY;University of Delaware, Newark, DE;University of Delaware, Newark, DE
Venue:
Proceedings of the 34th annual international symposium on Computer architecture
Year:
2007

Citing 31
Cited 18

The architecture of HEP

on Parallel MIMD computation: HEP supercomputer and its applications
Compiler algorithms for synchronization

IEEE Transactions on Computers
I-structures: data structures for parallel computing

ACM Transactions on Programming Languages and Systems (TOPLAS)
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Experience with fine-grain synchronization in MIMD machines for preconditioned conjugate gradient

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Transactional memory: architectural support for lock-free data structures

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Synchronization minimization in a SPMD execution model

Journal of Parallel and Distributed Computing - Special issue on distributed shared memory systems
Compiler optimizations for parallel loops with fine-grained synchronization

Compiler optimizations for parallel loops with fine-grained synchronization
Design of cache memories for multi-threaded dataflow architecture

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
Efficient synchronization: let them eat QOLB

Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor

Proceedings of the 25th annual international symposium on Computer architecture
High performance dynamic lock-free hash tables and list-based sets

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
Transactional lock-free execution of lock-based programs

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
The Message-Driven Processor: A Multicomputer Processing Node with Efficient Mechanisms

IEEE Micro
Sparcle: An Evolutionary Processor Design for Large-Scale Multiprocessors

IEEE Micro
Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Optimally Synchronizing DOACROSS Loops on Shared Memory Multiprocessors

PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques
Evaluation of a Multithreaded Architecture for Cellular Computing

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
LOW-COST SUPPORT FOR FINE-GRAIN SYNCHRONIZATION IN MULTIPROCESSORS

LOW-COST SUPPORT FOR FINE-GRAIN SYNCHRONIZATION IN MULTIPROCESSORS
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects

IEEE Transactions on Parallel and Distributed Systems
ELDORADO

Proceedings of the 2nd conference on Computing frontiers
Virtualizing Transactional Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture

HPCS '06 Proceedings of the 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment
Architectural Semantics for Practical Transactional Memory

Proceedings of the 33rd annual international symposium on Computer Architecture
Lightweight lock-free synchronization methods for multithreading

Proceedings of the 20th annual international conference on Supercomputing
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
A parallel dynamic programming algorithm on a multi-core architecture

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing

A Performance Model of Dense Matrix Operations on Many-Core Architectures

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Minimum Lock Assignment: A Method for Exploiting Concurrency among Critical Sections

Languages and Compilers for Parallel Computing
Just-In-Time Locality and Percolation for Optimizing Irregular Applications on a Manycore Architecture

Languages and Compilers for Parallel Computing
Techniques for efficient placement of synchronization primitives

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Architectural support for cilk computations on many-core architectures

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Mapping the LU decomposition on a many-core architecture: challenges and solutions

Proceedings of the 6th ACM conference on Computing frontiers
Synchronization optimizations for efficient execution on multi-cores

Proceedings of the 23rd international conference on Supercomputing
High Performance Matrix Multiplication on Many Cores

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
HPP controller: a system controller for high performance computing

Frontiers of Computer Science in China
Architectural Support for Fair Reader-Writer Locking

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Low-cost and energy-efficient distributed synchronization for embedded multiprocessors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Landing stencil code on Godson-T

Journal of Computer Science and Technology
Analysis and performance results of computing betweenness centrality on IBM Cyclops64

The Journal of Supercomputing
The elephant and the mice: the role of non-strict fine-grain synchronization for modern many-core architectures

Proceedings of the international conference on Supercomputing
Low-Overhead, high-speed multi-core barrier synchronization

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
SuperCoP: a general, correct, and performance-efficient supervised memory system

Proceedings of the 9th conference on Computing Frontiers
Synchronization mechanisms on modern multi-core architectures

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
HARS: A hardware-assisted runtime software for embedded many-core architectures

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.01

Visualization

Abstract

Efficient fine-grain synchronization is extremely important to effectively harness the computational power of many-core architectures. However, designing and implementing finegrain synchronization in such architectures presents several challenges, including issues of synchronization induced overhead, storage cost, scalability, and the level of granularity to which synchronization is applicable. This paper proposes the Synchronization State Buffer (SSB), a scalable architectural design for fine-grain synchronization that efficiently performs synchronizations between concurrent threads. The design of SSB is motivated by the following observation: at any instance during the parallel execution only a small fraction of memory locations are actively participating in synchronization. Based on this observation we present a fine-grain synchronization design that records and manages the states of frequently synchronized data using modest hardware support. We have implemented the SSB design in the context of the 160-core IBM Cyclops-64 architecture. Using detailed simulation, we present our experience for a set of benchmarks with different workload characteristics.