I-structures: data structures for parallel computing
ACM Transactions on Programming Languages and Systems (TOPLAS)
High performance preconditioning
SIAM Journal on Scientific and Statistical Computing
Exploiting heterogeneous parallelism on a multithreaded multiprocessor
ICS '92 Proceedings of the 6th international conference on Supercomputing
Monsoon: an explicit token-store architecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Solving Linear Systems on Vector and Shared Memory Computers
Solving Linear Systems on Vector and Shared Memory Computers
THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR
THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR
LOW-COST SUPPORT FOR FINE-GRAIN SYNCHRONIZATION IN MULTIPROCESSORS
LOW-COST SUPPORT FOR FINE-GRAIN SYNCHRONIZATION IN MULTIPROCESSORS
The MIT Alewife machine: architecture and performance
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor
Proceedings of the 25th annual international symposium on Computer architecture
The MIT Alewife machine: architecture and performance
25 years of the international symposia on Computer architecture (selected papers)
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Thread prioritization: a thread scheduling mechanism for multiple-context parallel processors
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 34th annual international symposium on Computer architecture
Journal of Parallel and Distributed Computing
Support for fine-grained synchronization in shared-memory multiprocessors
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Hi-index | 0.00 |
This paper discusses our experience with fine-grain synchronization for a variant of the preconditioned conjugate gradient method. This algorithm represents a large class of algorithms that have been widely used but traditionally difficult to implement efficiently on vector and parallel machines. Through a series of experiments conducted using a simulator of a distributed shared-memory multiprocessor, this paper addresses two major questions related to fine-grain synchronization in the context of this application. First, what is the overall impact of fine-grain synchronization on performance? Second, what are the individual contributions of the following three mechanisms typically provided to support fine-grain synchronization: language-level support, full-empty bits for compact storage and communication of synchronization state, and efficient processor operations on the state bits?Our expereiments indicate that fine-grain synchronization improves overall performancey by a factor of 3.7 on 16 processors using the largest problem size we could simulate; we project that significant performance advantage will be sustained for larger problem sizes. We also show that the bulk of the performance advantage for this application can be attributed to exposing increased parallelism through language-level expression of fine-grain synchronization. A smaller fraction relies on a compact implementation of synchronization state, while an even smaller fraction results from efficient full-empty bit operations.