Global optimizations for parallelism and locality on scalable parallel machines
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
The high performance Fortran handbook
The high performance Fortran handbook
The design and evolution of C++
The design and evolution of C++
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic data layout for high performance Fortran
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Operating system support for improving data locality on CC-NUMA compute servers
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Performance analysis using the MIPS R10000 performance counters
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
IEEE Parallel & Distributed Technology: Systems & Technology
IEEE Transactions on Parallel and Distributed Systems
Efficient Distribution Analysis via Graph Contraction
LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
An Overview of the Fortran D Programming System
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
A hyperplane based approach for optimizing spatial locality in loop nests
ICS '98 Proceedings of the 12th international conference on Supercomputing
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts
IEEE Transactions on Parallel and Distributed Systems
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Data Locality Exploitation in the Decomposition of Regular Domain Problems
IEEE Transactions on Parallel and Distributed Systems
A compiler technique for improving whole-program locality
POPL '01 Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Is data distribution necessary in OpenMP?
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
The trade-off between implicit and explicit data distribution in shared-memory programming paradigms
ICS '01 Proceedings of the 15th international conference on Supercomputing
Static and Dynamic Locality Optimizations Using Integer Linear Programming
IEEE Transactions on Parallel and Distributed Systems
Compiler-Directed Collective-I/O
IEEE Transactions on Parallel and Distributed Systems
Exploiting memory affinity in OpenMP through schedule reuse
ACM SIGARCH Computer Architecture News - Special Issue: PACT 2001 workshops
OpenMP on networks of workstations for software DSMs
Journal of Computer Science and Technology
Achieving Scalable Locality with Time Skewing
International Journal of Parallel Programming
Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models
International Journal of Parallel Programming
Design and Evaluation of a Compiler-Directed Collective I/O Technique
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
A Collective I/O Scheme Based on Compiler Analysis
LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Leveraging Transparent Data Distribution in OpenMP via User-Level Dynamic Page Migration
ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Improving whole-program locality using intra-procedural and inter-procedural transformations
Journal of Parallel and Distributed Computing
affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system
Proceedings of the 19th annual international conference on Supercomputing
A transparent runtime data distribution engine for OpenMP
Scientific Programming
Scaling non-regular shared-memory codes by reusing custom loop schedules
Scientific Programming - OpenMP
Analyses for the translation of OpenMP codes into SPMD style with array privatization
WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming
LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Efficient OpenMP data mapping for multicore platforms with vertically stacked memory
Proceedings of the Conference on Design, Automation and Test in Europe
Vertical stealing: robust, locality-aware do-all workload distribution for 3D MPSoCs
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Supporting OpenMP on a multi-cluster embedded MPSoC
Microprocessors & Microsystems
A hybrid strategy based on data distribution and migration for optimizing memory locality
LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Matching memory access patterns and data placement for NUMA systems
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Automatic extraction of multi-objective aware pipeline parallelism using genetic algorithms
Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Multi-objective aware extraction of task-level parallelism using genetic algorithms
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Automatic extraction of pipeline parallelism for embedded heterogeneous multi-core platforms
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Hi-index | 0.00 |
Cache-coherent multiprocessors with distributed shared memory are becoming increasingly popular for parallel computing. However, obtaining high performance on these machines mquires that an application execute with good data locality. In addition to making efiective use of caches, it is often necessary to distribute data structures across the local memories of the processing nodes, thereby reducing the latency of cache misses.We have designed a set of abstractions for performing data distribution in the context of explicitly parallel programs and implemented them within the SGI MIPSpro compiler system. Our system incorporates many unique features to enhance both programmability and performance. We address the former by providing a very simple programmming model with extensive support for error detection. Regarding performance, we carefully design the user abstractions with the underlying compiler optimizations in mind, we incorporate several optimization techniques to generate efficient code for accessing distributed data, and we provide a tight integration of these techniques with other optimizations within the compiler Our initial experience suggests that the directives are easy to use and can yield substantial performance gains, in some cases by as much as a factor of 3 over the same codes without distribution.