Circulating shared-registers for multiprocessor systems

Authors:
Donald Johnson;David J. Lilja;John Riedl
Affiliations:
CS Department, Azusa Pacific University, Azusa, CA;Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN;Department of Computer Science, University of Minnesota, Minneapolis, MN
Venue:
Journal of Systems Architecture: the EUROMICRO Journal
Year:
2006

Citing 34
Cited 0

Distributing Hot-Spot Addressing in Large-Scale Multiprocessors

IEEE Transactions on Computers
Reducing Contention in Shared-Memory Multiprocessors

Computer
Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
Distributed Shared Memory: A Survey of Issues and Algorithms

Computer - Distributed computing systems: separate resources acting as one
Implementation and performance of Munin

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
A tightly-coupled processor-network interface

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Cooperative shared memory: software and hardware for scalable multiprocessors

ACM Transactions on Computer Systems (TOCS)
The CM-5 Connection Machine: a scalable supercomputer

Communications of the ACM
A model for multi-grained parallelism (extended abstract)

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Efficient message passing interface (MPI) for parallel computing on clusters of workstations

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
MGS: a multigrain shared memory system

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Fine grain parallel communication on general purpose LANs

ICS '96 Proceedings of the 10th international conference on Supercomputing
Low-cost, high-performance barrier synchronization on networks of workstations

Journal of Parallel and Distributed Computing - Special issue on workstation clusters and network-based computing
Relaxed consistency and coherence granularity in DSM systems: a performance evaluation

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient synchronization: let them eat QOLB

Proceedings of the 24th annual international symposium on Computer architecture
Exploring fine-grained process interaction in multiprocessor systems

Exploring fine-grained process interaction in multiprocessor systems
Contention in shared memory algorithms

Journal of the ACM (JACM)
Informing memory operations: memory performance feedback mechanisms and their applications

ACM Transactions on Computer Systems (TOCS)
Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor

Proceedings of the 25th annual international symposium on Computer architecture
Retrospective: tempest and typhoon: user-level shared memory

25 years of the international symposia on Computer architecture (selected papers)
The Stanford FLASH multiprocessor

25 years of the international symposia on Computer architecture (selected papers)
Mechanisms and policies for supporting fine-grained cycle stealing

ICS '99 Proceedings of the 13th international conference on Supercomputing
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Multigrain shared memory

ACM Transactions on Computer Systems (TOCS)
α-coral: a multigrain, multithreaded processor architecture

ICS '01 Proceedings of the 15th international conference on Supercomputing
Towards a first vertical prototyping of an extremely fine-grained parallel programming approach

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Parallel Algorithms and Architectures

Parallel Algorithms and Architectures
A Study of High-Performance Communication Mechanism for Multicomputer Systems

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Reducing Waiting Costs in User-Level Communication

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Experience with Fine-Grain Communication in EM-X Multiprocessor for Parallel Sparse Matrix Computation

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Protected, user-level DMA for the SHRIMP network interface

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Improving Release-Consistent Shared Virtual Memory using Automatic Update

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Integrating Task and Data Parallelism in an Irregular Application: A Case Study

SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
(R) Performance Analysis and Prediction of Processor Scheduling Strategies in Multiprogrammed Shared - Memory Multiprocessors

ICPP '96 Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3

Quantified Score

Hi-index	0.00

Visualization

Abstract

The techniques for fine-grained data sharing are generally available only on specialized architectures, usually involving a shared-bus. The CIRculating Common-Update Sharing (CIRCUS) mechanism has low latency user-level contention-free access to a set of shared circulating data registers. The local access latency is near zero for both read and write operations. These operations can be mapped into more complex operations, such as arithmetic, logical, or data reduction operations such as minimum or sum to be performed by the circulating register hardware (CRH) on the circulating copy of a register. The CRH can also be used to perform atomic operations, such as fetch&add or swap. For a two-dimensional hierarchy of N processing elements (PEs), the write-latency (until the circulating register is updated with a new value) and the update-latency (when all CRH modules can see the updated value) have an optimum cluster size proportional to (N ċ I/D)1/2, where I is the intercluster time and D is the inter-PE time, including the time between and through one node. The latencies, when optimally clustered, are proportional to (N ċ I ċ D)1/2. Sub-microsecond write-latency is expected for up to 15,255 PEs or 660 workstations. For higher levels of hierarchy, the expected write-latency is shown to be proportional to the sum of the latencies of all loop hierarchies. CIRCUS is applicable to a wide variety of system architectures and topologies.