Optimal parallel algorithms for constructing and maintaining a balanced m-way search tree
International Journal of Parallel Programming
ICS '90 Proceedings of the 4th international conference on Supercomputing
IEEE Micro
Session Guarantees for Weakly Consistent Replicated Data
PDIS '94 Proceedings of the Third International Conference on Parallel and Distributed Information Systems
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
From Causal Consistency to Sequential Consistency in Shared Memory Systems
Proceedings of the 15th Conference on Foundations of Software Technology and Theoretical Computer Science
A performance analysis of the Berkeley UPC compiler
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs
Journal of Parallel and Distributed Computing
Exploration of distributed shared memory architectures for NoC-based multiprocessors
Journal of Systems Architecture: the EUROMICRO Journal
HPCMP-UGC '08 Proceedings of the 2008 DoD HPCMP Users Group Conference
Specifying and dynamically verifying address translation-aware memory consistency
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Fast PGAS connected components algorithms
Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
Hi-index | 0.00 |
We describe design details of a Light Weight Processing migration-NUMA architecture, a novel high performance system design that provides hardware support for a partitioned global address space, migrating subjects, and word level synchronization primitives. Using the architectural definition, combinations of structures are shown to work together to carry out basic actions such as address translation, migration, in-memory synchronization, and work management. We present results from simulation of microkernels showing that LWP-mNUMA compensates for latency with far greater memory access concurrency than possible on a conventional systems. In particular, several microkernels model tough, irregular access patterns that have limited speedups -- in certain problem areas -- to dozens of conventional processors. On these, results show speedup increasing up to 1024 multicore mNUMA processing nodes, running over 1 million threadlets.