I-structures: data structures for parallel computing
ACM Transactions on Programming Languages and Systems (TOPLAS)
Experience with fine-grain synchronization in MIMD machines for preconditioned conjugate gradient
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Complexity/performance tradeoffs with non-blocking loads
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Reactive synchronization algorithms for multiprocessors
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The performance advantages of integrating block data transfer in cache-coherent multiprocessors
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Software caching and computation migration in Olden
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ICS '90 Proceedings of the 4th international conference on Supercomputing
Efficient synchronization: let them eat QOLB
Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
The SimpleScalar tool set, version 2.0
ACM SIGARCH Computer Architecture News
Informing memory operations: memory performance feedback mechanisms and their applications
ACM Transactions on Computer Systems (TOCS)
Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor
Proceedings of the 25th annual international symposium on Computer architecture
Designing Tree-Based Barrier Synchronization on 2D Mesh Networks
IEEE Transactions on Parallel and Distributed Systems
Data speculation support for a chip multiprocessor
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
IEEE Transactions on Parallel and Distributed Systems
ACM Transactions on Computer Systems (TOCS)
A scalable approach to thread-level speculation
Proceedings of the 27th annual international symposium on Computer architecture
Wattch: a framework for architectural-level power analysis and optimizations
Proceedings of the 27th annual international symposium on Computer architecture
Architectural support for scalable speculative parallelization in shared-memory multiprocessors
Proceedings of the 27th annual international symposium on Computer architecture
Multiprocessor enhancements of the SimpleScalar tool set
ACM SIGARCH Computer Architecture News
LoGPC: Modeling Network Contention in Message-Passing Programs
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Speculative lock elision: enabling highly concurrent multithreaded execution
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Transactional lock-free execution of lock-based programs
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Speculative synchronization: applying thread-level speculation to explicitly parallel applications
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Design and evaluation of compiler algorithms for pre-execution
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
M-Structures: Extending a Parallel, Non-strict, Functional Language with State
Proceedings of the 5th ACM Conference on Functional Programming Languages and Computer Architecture
Using thread-level speculation to simplify manual parallelization
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
In Search of Speculative Thread-Level Parallelism
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Hi-index | 0.00 |
The quest to improve performance forces designers to explore finer-grained multiprocessor machines. Ever increasing chip densities based on CMOS improvements fuel research in highly parallel chip multiprocessors with 100s of processing elements. With such increasing levels of parallelism, synchronization is set to become a major performance bottleneck and efficient support for synchronization an important design criterion. Previous research has shown that integrating support for fine-grained synchronization can have significant performance benefits compared to traditional coarse-grained synchronization. Not much progress has been made in supporting fine-grained synchronization transparently to processor nodes: a key reason perhaps why wide adoption has not followed. In this paper, we propose a novel approach called synchronization coherence that can provide transparent fine-grained synchronization and caching in a multiprocessor machine and single-chip multiprocessor. Our approach merges fine-grained synchronization mechanisms with traditional cache coherence protocols. It reduces network utilization as well as synchronization related processing overheads while adding minimal hardware complexity as compared to cache coherence mechanisms or previously reported fine-grained synchronization techniques. In addition to its benefit of making synchronization transparent to processor nodes, for the applications studied, it provides up to 23% improvement in performance and up to 24% improvement in energy efficiency with no L2 caches compared to previous fine-grained synchronization techniques. The performance improvement increases up to 38% when simulating with an ideal L2 cache system.