Portable programs for parallel processors
Portable programs for parallel processors
A cache coherence scheme with fast selective invalidation
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Evaluating the performance of software cache coherence
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Memory coherence in shared virtual memory systems
ACM Transactions on Computer Systems (TOCS)
LimitLESS directories: A scalable cache coherence scheme
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The Stanford Dash Multiprocessor
Computer
Lazy release consistency for software distributed shared memory
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Cooperative shared memory: software and hardware for scalable multiprocessor
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
The Peregrine high-performance RPC system
Software—Practice & Experience
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Operating system support for modern memory hierarchies
Operating system support for modern memory hierarchies
Memory consistency and event ordering in scalable shared-memory multiprocessors
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
A Unified Formalization of Four Shared-Memory Models
IEEE Transactions on Parallel and Distributed Systems
A low-overhead coherence solution for multiprocessors with private cache memories
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
SPLASH: Stanford parallel applications for shared-memory
SPLASH: Stanford parallel applications for shared-memory
LCM: memory system support for parallel language implementation
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Fine-grain access control for distributed shared memory
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Efficient support for irregular applications on distributed-memory machines
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Higher-order distributed objects
ACM Transactions on Programming Languages and Systems (TOPLAS)
A comprehensive bibliography of distributed shared memory
ACM SIGOPS Operating Systems Review
The interaction of parallel and sequential workloads on a network of workstations
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
CRL: high-performance all-software distributed shared memory
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
MGS: a multigrain shared memory system
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Understanding application performance on shared virtual memory systems
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Fine grain parallel communication on general purpose LANs
ICS '96 Proceedings of the 10th international conference on Supercomputing
Towards transparent and efficient software distributed shared memory
Proceedings of the sixteenth ACM symposium on Operating systems principles
Cashmere-2L: software coherent shared memory on a clustered remote-write network
Proceedings of the sixteenth ACM symposium on Operating systems principles
Evaluation of hardware write propagation support for next-generation shared virtual memory clusters
ICS '98 Proceedings of the 12th international conference on Supercomputing
Adapting the Network Interface for High-Performance Computing: The CNI Approach
The Journal of Supercomputing - Special issue: high performance distributed computing
Predicting the performance of distributed virtual shared-memory applications
IBM Systems Journal
Shared virtual memory with automatic update support
ICS '99 Proceedings of the 13th international conference on Supercomputing
The scalability of multigrain systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Comparative study of page-based and segment-based software DSM through compiler optimization
Proceedings of the 14th international conference on Supercomputing
ACM Transactions on Computer Systems (TOCS)
The Cost of Communication Protocols and Coordination Languages in Embedded Systems
COORDINATION '02 Proceedings of the 5th International Conference on Coordination Models and Languages
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Improving Release-Consistent Shared Virtual Memory using Automatic Update
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
CNI: A High-Performance Network Interface for Workstation Clusters
HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
A Performance Debugger for Eliminating Excess Synchronization in Shared-Memory Parallel Programs
MASCOTS '96 Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Improving the Performance of Software Distributed Shared Memory with Speculation
IEEE Transactions on Parallel and Distributed Systems
Shared memory computing on clusters with symmetric multiprocessors and system area networks
ACM Transactions on Computer Systems (TOCS)
The design and evaluation of a shared object system for distributed memory machines
OSDI '94 Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation
ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Implementing an OpenMP execution environment on InfiniBand clusters
IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Searching an optimal history size for history-based page prefetching on software DSM systems
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Hi-index | 0.00 |
We compare the performance of software-supported shared memory on a general-purpose network to hardware-supported shared memory on a dedicated interconnect.Up to eight processors, our results are based on the execution of a set of application programs on a SGI 4D/480 multiprocessor and on TreadMarks, a distributed shared memory system that runs on a Fore ATM LAN of DECstation-5000/240s. Since the DECstation and the 4D/480 use the same processor, primary cache, and compiler, the shared-memory implementation is the principal difference between the systems. Our results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the difference in performance grows as the synchronization frequency increases. For applications that require a large amount of memory bandwidth, TreadMarks can perform better than the SGI 4D/480.Beyond eight processors, our results are based on execution-driven simulation. Specifically, we compare a software implementation on a general-purpose network of uniprocessor nodes, a hardware implementation using a directory-based protocol on a dedicated interconnect, and a combined implementation using software to provide shared memory between multiprocessor nodes with hardware implementing shared memory within a node. For the modest size of the problems that we can simulate, the hardware implementation scales well and the software implementation scales poorly. The combined approach delivers performance close to that of the hardware implementation for applications with small to moderate synchronization rates and good locality. Reductions in communication overhead improve the performance of the software and the combined approach, but synchronization remains a bottleneck.