An integrated compile-time/run-time software distributed shared memory system

Authors:
Sandhya Dwarkadas;Alan L. Cox;Willy Zwaenepoel
Affiliations:
Department of Computer Science, Rice University;Department of Computer Science, Rice University;Department of Computer Science, Rice University
Venue:
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Year:
1996

Citing 19
Cited 47

Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Orca: A Language for Parallel Programming of Distributed Systems

IEEE Transactions on Software Engineering
Network-based concurrent computing on the PVM system

Concurrency: Practice and Experience
Compiling Fortran D for MIMD distributed-memory machines

Communications of the ACM
Lazy release consistency for software distributed shared memory

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Integrating message-passing and shared-memory: early experience

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Managing pages in shared virtual memory systems: getting the compiler into the game

ICS '93 Proceedings of the 7th international conference on Supercomputing
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance advantages of integrating block data transfer in cache-coherent multiprocessors

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Techniques for reducing consistency-related communication in distributed shared-memory systems

ACM Transactions on Computer Systems (TOCS)
Message passing versus distributed shared memory on networks of workstations

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
TreadMarks: Shared Memory Computing on Networks of Workstations

Computer
Compiler-directed data prefetching in multiprocessors with memory hierarchies

ICS '90 Proceedings of the 4th international conference on Supercomputing
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
An Implementation of Interprocedural Bounded Regular Section Analysis

IEEE Transactions on Parallel and Distributed Systems
Computing Per-Process Summary Side-Effect Information

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing

Hiding communication latency and coherence overhead in software DSMs

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Compiler and software distributed shared memory support for irregular applications

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing communication in HPF programs on fine-grain distributed shared memory

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
VM-based shared memory on low-latency, remote-memory-access networks

Proceedings of the 24th annual international symposium on Computer architecture
Performance evaluation of the Orca shared-object system

ACM Transactions on Computer Systems (TOCS)
Data prefetching for software DSMs

ICS '98 Proceedings of the 12th international conference on Supercomputing
Hardware Support for Flexible Distributed Shared Memory

IEEE Transactions on Computers
A task- and data-parallel programming language based on shared objects

ACM Transactions on Programming Languages and Systems (TOPLAS)
Tapeworm: high-level abstractions of shared accesses

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Maps: a compiler-managed memory system for raw machines

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A high-level abstraction of shared accesses

ACM Transactions on Computer Systems (TOCS)
Adaptive reduction parallelization techniques

Proceedings of the 14th international conference on Supercomputing
Comparative study of page-based and segment-based software DSM through compiler optimization

Proceedings of the 14th international conference on Supercomputing
A synthesis of memory mechanisms for distributed architectures

ICS '01 Proceedings of the 15th international conference on Supercomputing
Dynamic adaptation to available resources for parallel computing in an autonomous network of workstations

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Source-level global optimizations for fine-grain distributed shared memory systems

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Reducing coherence overhead of barrier synchronization in software DSMs

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
OpenMP on networks of workstations

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Using high performance GIS software to visualize data: a hands-on software demonstration

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A proposal for preservice student technology competence

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Design issues for a high-performance distributed shared memory on symmetrical multiprocessor clusters

Cluster Computing
Eliminating Barrier Synchronization for Compiler-Parallelized Codes on Software DSMs

International Journal of Parallel Programming
Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Enhancing Software DSM for Compiler-Parallelized Applications

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Improving Compiler and Run-Time Support for Irregular Reductions Using Local Writes

LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
Compiler and Run-Time Support for Adaptive Load Balancing in Software Distributed Shared Memory Systems

LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Compilation and Runtime-Optimizations for Software Distributed Shared Memory

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Efficient Categorization of Sharing Patterns in Software DSM Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
View consistencies and exact implementations

Parallel Computing
Parallelizing Applications into Silicon

FCCM '99 Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Evaluation of Compiler-Assisted Software DSM Schemes for a Workstation Cluster

IWIA '99 Proceedings of the 1999 International Workshop on Innovative Architecture
Compile-time Synchronization Optimizations for Software DSMs

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
CAS-DSM: a compiler assisted software distributed shared memory

International Journal of Parallel Programming
Dyn-MPI: Supporting MPI on Non Dedicated Clusters

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Combined compile-time and runtime-driven, pro-active data movement in software DSM systems

LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Shared memory computing on clusters with symmetric multiprocessors and system area networks

ACM Transactions on Computer Systems (TOCS)
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dyn-MPI: Supporting MPI on medium-scale, non-dedicated clusters

Journal of Parallel and Distributed Computing
CRAUL: Compiler and run-time integration for adaptation under load[1]This work was supported in part by NSF grants CDA-9401142, CCR-9702466, and CCR-9705594; and an external research grant from Compaq.

Scientific Programming
A characterization of shared data access patterns in UPC programs

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Compiling for a hybrid programming model using the LMAD representation

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Integrating MPI and nanothreads programming model

EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform

ACM SIGOPS Operating Systems Review
Probabilistic analysis of time reduction by eliminating barriers in parallel programmes

International Journal of Communication Networks and Distributed Systems
A hybrid approach of OpenMP for clusters

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Barrier elimination based on access dependency analysis for OpenMP

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors

Journal of Signal Processing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance of hand-coded message passing by translating data-parallel programs into message passing programs, but efficient execution is limited to those programs for which precise analysis can be carried out. Shared memory is easier to program than message passing and its domain is not constrained by the limitations of parallelizing compilers, but it lags in performance. Our goal is to close that performance gap while retaining the benefits of shared memory. In other words, our goal is (1) to make shared memory as efficient as message passing, whether hand-coded or compiler-generated, (2) to retain its ease of programming, and (3) to retain the broader class of applications it supports.To this end we have designed and implemented an integrated compile-time and run-time software DSM system. The programming model remains identical to the original pure run-time DSM system. No user intervention is required to obtain the benefits of our system. The compiler computes data access patterns for the individual processors. It then performs a source-to-source transformation, inserting in the program calls to inform the run-time system of the computed data access patterns. The run-time system uses this information to aggregate communication, to aggregate data and synchronization into a single message, to eliminate consistency overhead, and to replace global synchronization with point-to-point synchronization wherever possible.We extended the Parascope programming environment to perform the required analysis, and we augmented the TreadMarks run-time DSM library to take advantage of the analysis. We used six Fortran programs to assess the performance benefits: Jacobi, 3D-FFT, Integer Sort, Shallow, Gauss, and Modified Gramm-Schmidt, each with two different data set sizes. The experiments were run on an 8-node IBM SP/2 using user-space communication. Compiler optimization in conjunction with the augmented run-time system achieves substantial execution time improvements in comparison to the base TreadMarks, ranging from 4% to 59% on 8 processors. Relative to message passing implementations of the same applications, the compile-time run-time system is 0-29% slower than message passing, while the base run-time system is 5-212% slower. For the five programs that XHPF could parallelize (all except IS), the execution times achieved by the compiler optimized shared memory programs are within 9% of XHPF.