VLSI assist for a multiprocessor
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Efficient synchronization primitives for large-scale cache-coherent multiprocessors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Algorithms for scalable synchronization on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
Fast barrier synchronization hardware
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
The network architecture of the Connection Machine CM-5 (extended abstract)
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
An effective synchronization network for hot-spot accesses
ACM Transactions on Computer Systems (TOCS)
Experience with fine-grain synchronization in MIMD machines for preconditioned conjugate gradient
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Complexity/performance tradeoffs with non-blocking loads
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters
IEEE Transactions on Parallel and Distributed Systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Synchronization and communication in the T3E multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor
Proceedings of the 25th annual international symposium on Computer architecture
System-on-a-chip processor synchronization support in hardware
Proceedings of the conference on Design, automation and test in Europe
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
The NYU Ultracomputer—designing a MIMD, shared-memory parallel machine (Extended Abstract)
ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Fast Synchronization on Scalable Cache-Coherent Multiprocessors using Hybrid Primitives
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Fast Barriers for Scalable ccNUMA Systems
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
High-Performance Throughput Computing
IEEE Micro
Packaging the Blue Gene/L supercomputer
IBM Journal of Research and Development
Design and implementation of message-passing services for the Blue Gene/L supercomputer
IBM Journal of Research and Development
Proceedings of the 34th annual international symposium on Computer architecture
Lightweight barrier-based parallelization support for non-cache-coherent MPSoC platforms
CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Rigel: an architecture and scalable programming interface for a 1000-core accelerator
Proceedings of the 36th annual international symposium on Computer architecture
ECMon: exposing cache events for monitoring
Proceedings of the 36th annual international symposium on Computer architecture
Efficient and scalable barrier synchronization for many-core CMPs
Proceedings of the 7th ACM international conference on Computing frontiers
ReMAP: A Reconfigurable Heterogeneous Multicore Architecture
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Low-cost and energy-efficient distributed synchronization for embedded multiprocessors
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Parallelization libraries: Characterizing and reducing overheads
ACM Transactions on Architecture and Code Optimization (TACO)
TLSync: support for multiple fast barriers using on-chip transmission lines
Proceedings of the 38th annual international symposium on Computer architecture
Hardware support for OpenMP collective operations
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Speculative optimizations for parallel programs on multicores
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Low-Overhead, high-speed multi-core barrier synchronization
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Single thread program parallelism with dataflow abstracting thread
ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Hi-index | 0.00 |
We examine the ability of CMPs, due to their lower onchip communication latencies, to exploit data parallelism at inner-loop granularities similar to that commonly targeted by vector machines. Parallelizing code in this manner leads to a high frequency of barriers, and we explore the impact of different barrier mechanisms upon the efficiency of this approach. To further exploit the potential of CMPs for fine-grained data parallel tasks, we present barrier filters, a mechanism for fast barrier synchronization on chip multi-processors to enable vector computations to be efficiently distributed across the cores of a CMP. We ensure that all threads arriving at a barrier require an unavailable cache line to proceed, and, by placing additional hardware in the shared portions of the memory subsytem, we starve their requests until they all have arrived. Specifically, our approach uses invalidation requests to both make cache lines unavailable and identify when a thread has reached the barrier. We examine two types of barrier filters, one synchronizing through instruction cache lines, and the other through data cache lines.