Memory coherence in shared virtual memory systems
ACM Transactions on Computer Systems (TOCS)
FFTs in external or hierarchical memory
The Journal of Supercomputing
Munin: distributed shared memory based on type-specific memory coherence
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
A comparison of sorting algorithms for the connection machine CM-2
SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
Journal of Parallel and Distributed Computing
Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Volume rendering on scalable shared-memory MIMD architectures
VVS '92 Proceedings of the 1992 workshop on Volume visualization
Virtual memory mapped network interface for the SHRIMP multicomputer
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance advantages of integrating block data transfer in cache-coherent multiprocessors
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
U-Net: a user-level network interface for parallel and distributed computing
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
High performance messaging on workstations: Illinois fast messages (FM) for Myrinet
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
MGS: a multigrain shared memory system
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Understanding application performance on shared virtual memory systems
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Hiding communication latency and coherence overhead in software DSMs
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Telegraphos: a substrate for high-performance computing on workstation clusters
Journal of Parallel and Distributed Computing
Relaxed consistency and coherence granularity in DSM systems: a performance evaluation
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
VM-based shared memory on low-latency, remote-memory-access networks
Proceedings of the 24th annual international symposium on Computer architecture
Cashmere-2L: software coherent shared memory on a clustered remote-write network
Proceedings of the sixteenth ACM symposium on Operating systems principles
Evaluation of hardware write propagation support for next-generation shared virtual memory clusters
ICS '98 Proceedings of the 12th international conference on Supercomputing
Performance monitoring in a Myrinet-connected SHRIMP cluster
SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Scaling application performance on a cache-coherent multiprocessor
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
User-space communication: a quantitative study
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
The effects of communication parameters on end performance of shared virtual memory clusters
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Overview of memory channel network for PCI
COMPCON '96 Proceedings of the 41st IEEE International Computer Conference
Using memory-mapped network interfaces to improve the performance of distributed shared memory
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Improving Release-Consistent Shared Virtual Memory using Automatic Update
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Scheduling Communication on an SMP Node Parallel Machine
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Home-Based SVM Protocols for SMP Clusters: Design and Performance
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Fine-Grain Software Distributed Shared Memory on SMP Clusters
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Limits to the Performance of Software Shared Memory: A Layered Approach
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
TreadMarks: distributed shared memory on standard workstations and operating systems
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
ESP: a language for programmable devices
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Dynamic memory management for programmable devices
Proceedings of the 3rd international symposium on Memory management
Removing the overhead from software-based shared memory
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Message Passing Vs. Shared Address Space on a Clusters of SMPs
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Priority Based Messaging for Software Distributed Shared Memory
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
CableS: Thread Control and Memory System Extensions for Shared Virtual Memory Clusters
WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
Active Memory Clusters: Efficient Multiprocessing on Commodity Clusters
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Message passing and shared address space parallelism on an SMP cluster
Parallel Computing
miNI: reducing network interface memory requirements with dynamic handle lookup
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Active I/O Switches in System Area Networks
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Journal of Parallel and Distributed Computing
Cluster communication protocols for parallel-programming systems
ACM Transactions on Computer Systems (TOCS)
Hyperplane Grouping and Pipelined Schedules: How to Execute Tiled Loops Fast on Clusters of SMPs
The Journal of Supercomputing
IEEE Transactions on Parallel and Distributed Systems
Shared memory computing on clusters with symmetric multiprocessors and system area networks
ACM Transactions on Computer Systems (TOCS)
Data-Driven Multithreading Using Conventional Microprocessors
IEEE Transactions on Parallel and Distributed Systems
Addressing a workload characterization study to the design of consistency protocols
The Journal of Supercomputing
TMA: a trap-based memory architecture
Proceedings of the 20th annual international conference on Supercomputing
ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
COMIC: a coherent shared memory interface for cell be
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Exploiting locality: a flexible DSM approach
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hi-index | 0.00 |
The performance of page-based software shared virtual memory (SVM) is still far from that achieved on hardware-coherent distributed shared memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity.This paper shows that by providing simple and general support for asynchronous message handling in a commodity network interface (NI), and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVM-dependent. They also require neither visibility into the node memory system nor code instrumentation to identify memory operations. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support in a shared Memory Abstraction), on a cluster of SMPs with a programmable NI, though the mechanisms are simple and do not require programmability.We find that the performance improvements are substantial, bringing performance on a small-scale SMP cluster much closer to that of hardware-coherent shared memory for many applications, and we show the value of each of the mechanisms in different applications. Application performance improves by about 37% on average for reasonably well performing applications, even on our relatively slow programmable NI, and more for others. We discuss the key remaining bottlenecks at the protocol level and use a firmware performance monitor in the NI to understand the interactions with and the implications for the communication layer.