Memory coherence in shared virtual memory systems
ACM Transactions on Computer Systems (TOCS)
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Orca: A Language for Parallel Programming of Distributed Systems
IEEE Transactions on Software Engineering
Network-based concurrent computing on the PVM system
Concurrency: Practice and Experience
Compiling Fortran D for MIMD distributed-memory machines
Communications of the ACM
Lazy release consistency for software distributed shared memory
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Integrating message-passing and shared-memory: early experience
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Managing pages in shared virtual memory systems: getting the compiler into the game
ICS '93 Proceedings of the 7th international conference on Supercomputing
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance advantages of integrating block data transfer in cache-coherent multiprocessors
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Reducing false sharing on shared memory multiprocessors through compile time data transformations
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Techniques for reducing consistency-related communication in distributed shared-memory systems
ACM Transactions on Computer Systems (TOCS)
Message passing versus distributed shared memory on networks of workstations
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Compiler-directed data prefetching in multiprocessors with memory hierarchies
ICS '90 Proceedings of the 4th international conference on Supercomputing
Memory consistency and event ordering in scalable shared-memory multiprocessors
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
An Implementation of Interprocedural Bounded Regular Section Analysis
IEEE Transactions on Parallel and Distributed Systems
Computing Per-Process Summary Side-Effect Information
Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Hiding communication latency and coherence overhead in software DSMs
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Compiler and software distributed shared memory support for irregular applications
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing communication in HPF programs on fine-grain distributed shared memory
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
VM-based shared memory on low-latency, remote-memory-access networks
Proceedings of the 24th annual international symposium on Computer architecture
Performance evaluation of the Orca shared-object system
ACM Transactions on Computer Systems (TOCS)
Data prefetching for software DSMs
ICS '98 Proceedings of the 12th international conference on Supercomputing
Hardware Support for Flexible Distributed Shared Memory
IEEE Transactions on Computers
A task- and data-parallel programming language based on shared objects
ACM Transactions on Programming Languages and Systems (TOPLAS)
Tapeworm: high-level abstractions of shared accesses
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Maps: a compiler-managed memory system for raw machines
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A high-level abstraction of shared accesses
ACM Transactions on Computer Systems (TOCS)
Adaptive reduction parallelization techniques
Proceedings of the 14th international conference on Supercomputing
Comparative study of page-based and segment-based software DSM through compiler optimization
Proceedings of the 14th international conference on Supercomputing
A synthesis of memory mechanisms for distributed architectures
ICS '01 Proceedings of the 15th international conference on Supercomputing
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Source-level global optimizations for fine-grain distributed shared memory systems
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Reducing coherence overhead of barrier synchronization in software DSMs
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
OpenMP on networks of workstations
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Using high performance GIS software to visualize data: a hands-on software demonstration
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A proposal for preservice student technology competence
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Eliminating Barrier Synchronization for Compiler-Parallelized Codes on Software DSMs
International Journal of Parallel Programming
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Enhancing Software DSM for Compiler-Parallelized Applications
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Improving Compiler and Run-Time Support for Irregular Reductions Using Local Writes
LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Compilation and Runtime-Optimizations for Software Distributed Shared Memory
LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Efficient Categorization of Sharing Patterns in Software DSM Systems
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
View consistencies and exact implementations
Parallel Computing
Parallelizing Applications into Silicon
FCCM '99 Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Evaluation of Compiler-Assisted Software DSM Schemes for a Workstation Cluster
IWIA '99 Proceedings of the 1999 International Workshop on Innovative Architecture
Compile-time Synchronization Optimizations for Software DSMs
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
CAS-DSM: a compiler assisted software distributed shared memory
International Journal of Parallel Programming
Dyn-MPI: Supporting MPI on Non Dedicated Clusters
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Combined compile-time and runtime-driven, pro-active data movement in software DSM systems
LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Shared memory computing on clusters with symmetric multiprocessors and system area networks
ACM Transactions on Computer Systems (TOCS)
Optimizing Compiler for the CELL Processor
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dyn-MPI: Supporting MPI on medium-scale, non-dedicated clusters
Journal of Parallel and Distributed Computing
A characterization of shared data access patterns in UPC programs
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Compiling for a hybrid programming model using the LMAD representation
LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Integrating MPI and nanothreads programming model
EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform
ACM SIGOPS Operating Systems Review
Probabilistic analysis of time reduction by eliminating barriers in parallel programmes
International Journal of Communication Networks and Distributed Systems
A hybrid approach of OpenMP for clusters
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Barrier elimination based on access dependency analysis for OpenMP
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors
Journal of Signal Processing Systems
Hi-index | 0.00 |
On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance of hand-coded message passing by translating data-parallel programs into message passing programs, but efficient execution is limited to those programs for which precise analysis can be carried out. Shared memory is easier to program than message passing and its domain is not constrained by the limitations of parallelizing compilers, but it lags in performance. Our goal is to close that performance gap while retaining the benefits of shared memory. In other words, our goal is (1) to make shared memory as efficient as message passing, whether hand-coded or compiler-generated, (2) to retain its ease of programming, and (3) to retain the broader class of applications it supports.To this end we have designed and implemented an integrated compile-time and run-time software DSM system. The programming model remains identical to the original pure run-time DSM system. No user intervention is required to obtain the benefits of our system. The compiler computes data access patterns for the individual processors. It then performs a source-to-source transformation, inserting in the program calls to inform the run-time system of the computed data access patterns. The run-time system uses this information to aggregate communication, to aggregate data and synchronization into a single message, to eliminate consistency overhead, and to replace global synchronization with point-to-point synchronization wherever possible.We extended the Parascope programming environment to perform the required analysis, and we augmented the TreadMarks run-time DSM library to take advantage of the analysis. We used six Fortran programs to assess the performance benefits: Jacobi, 3D-FFT, Integer Sort, Shallow, Gauss, and Modified Gramm-Schmidt, each with two different data set sizes. The experiments were run on an 8-node IBM SP/2 using user-space communication. Compiler optimization in conjunction with the augmented run-time system achieves substantial execution time improvements in comparison to the base TreadMarks, ranging from 4% to 59% on 8 processors. Relative to message passing implementations of the same applications, the compile-time run-time system is 0-29% slower than message passing, while the base run-time system is 5-212% slower. For the five programs that XHPF could parallelize (all except IS), the execution times achieved by the compiler optimized shared memory programs are within 9% of XHPF.