Distributing Hot-Spot Addressing in Large-Scale Multiprocessors
IEEE Transactions on Computers
Memory coherence in shared virtual memory systems
ACM Transactions on Computer Systems (TOCS)
Distributed Shared Memory: A Survey of Issues and Algorithms
Computer - Distributed computing systems: separate resources acting as one
Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
A tightly-coupled processor-network interface
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Cooperative shared memory: software and hardware for scalable multiprocessors
ACM Transactions on Computer Systems (TOCS)
The CM-5 Connection Machine: a scalable supercomputer
Communications of the ACM
A model for multi-grained parallelism (extended abstract)
SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Efficient message passing interface (MPI) for parallel computing on clusters of workstations
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
MGS: a multigrain shared memory system
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Fine grain parallel communication on general purpose LANs
ICS '96 Proceedings of the 10th international conference on Supercomputing
Low-cost, high-performance barrier synchronization on networks of workstations
Journal of Parallel and Distributed Computing - Special issue on workstation clusters and network-based computing
Relaxed consistency and coherence granularity in DSM systems: a performance evaluation
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient synchronization: let them eat QOLB
Proceedings of the 24th annual international symposium on Computer architecture
Exploring fine-grained process interaction in multiprocessor systems
Exploring fine-grained process interaction in multiprocessor systems
Contention in shared memory algorithms
Journal of the ACM (JACM)
Informing memory operations: memory performance feedback mechanisms and their applications
ACM Transactions on Computer Systems (TOCS)
Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor
Proceedings of the 25th annual international symposium on Computer architecture
Retrospective: tempest and typhoon: user-level shared memory
25 years of the international symposia on Computer architecture (selected papers)
The Stanford FLASH multiprocessor
25 years of the international symposia on Computer architecture (selected papers)
Mechanisms and policies for supporting fine-grained cycle stealing
ICS '99 Proceedings of the 13th international conference on Supercomputing
The directory-based cache coherence protocol for the DASH multiprocessor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
ACM Transactions on Computer Systems (TOCS)
α-coral: a multigrain, multithreaded processor architecture
ICS '01 Proceedings of the 15th international conference on Supercomputing
Towards a first vertical prototyping of an extremely fine-grained parallel programming approach
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Parallel Algorithms and Architectures
Parallel Algorithms and Architectures
A Study of High-Performance Communication Mechanism for Multicomputer Systems
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Reducing Waiting Costs in User-Level Communication
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Protected, user-level DMA for the SHRIMP network interface
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Improving Release-Consistent Shared Virtual Memory using Automatic Update
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Integrating Task and Data Parallelism in an Irregular Application: A Case Study
SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
ICPP '96 Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3
Hi-index | 0.00 |
The techniques for fine-grained data sharing are generally available only on specialized architectures, usually involving a shared-bus. The CIRculating Common-Update Sharing (CIRCUS) mechanism has low latency user-level contention-free access to a set of shared circulating data registers. The local access latency is near zero for both read and write operations. These operations can be mapped into more complex operations, such as arithmetic, logical, or data reduction operations such as minimum or sum to be performed by the circulating register hardware (CRH) on the circulating copy of a register. The CRH can also be used to perform atomic operations, such as fetch&add or swap. For a two-dimensional hierarchy of N processing elements (PEs), the write-latency (until the circulating register is updated with a new value) and the update-latency (when all CRH modules can see the updated value) have an optimum cluster size proportional to (N ċ I/D)1/2, where I is the intercluster time and D is the inter-PE time, including the time between and through one node. The latencies, when optimally clustered, are proportional to (N ċ I ċ D)1/2. Sub-microsecond write-latency is expected for up to 15,255 PEs or 660 workstations. For higher levels of hierarchy, the expected write-latency is shown to be proportional to the sum of the latencies of all loop hierarchies. CIRCUS is applicable to a wide variety of system architectures and topologies.