Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems (TOPLAS)
Hector: A Hierarchically Structured Shared-Memory Multiprocessor
Computer - Special issue on experimental research in computer architecture
The DINO parallel programming language
Journal of Parallel and Distributed Computing
Compiler optimizations for Fortran D on MIMD distributed-memory machines
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Architecture-independent scientific programming in data parallel C: three case studies
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Evaluation of compiler optimizations for Fortran D on MIMD distributed memory machines
ICS '92 Proceedings of the 6th international conference on Supercomputing
Implementation of a portable nested data-parallel language
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing Supercompilers for Supercomputers
Optimizing Supercompilers for Supercomputers
Portable Programs for Parallel Processors
Portable Programs for Parallel Processors
Data-Parallel Programming on Multicomputers
IEEE Software
Compiling Communication-Efficient Programs for Massively Parallel Machines
IEEE Transactions on Parallel and Distributed Systems
Compiling Global Name-Space Parallel Loops for Distributed Execution
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Algorithms for memory hierarchies: advanced lectures
Algorithms for memory hierarchies: advanced lectures
Hi-index | 0.00 |
Data parallel programming has been widely used in developing scientific applications on various types of parallel machines: SIMD, MIMD distributed memory machines, and UMA shared memory machines. On NUMA shared memory machines, data locality is the key to good performance of parallel applications. In this paper, we propose a set of macros (NUMACROS) for data parallel programming on NUMA machines. NUMACROS attempts to achieve both ease of programming and data locality for performance. Programs written using NUMACROS are nearly as short and easily readable as sequential versions of the programs. To obtain data locality, data and loops are distributed and partitioned in a coordinated fashion among the processors. Although global address spaces facilitate data distribution on NUMA systems, a naive implementation of an application will suffer from high costs. To reduce the cost, a number of approaches have been proposed and evaluated. These include index precomputing, index checking, loop transformation, and others. Our experimental results, with the Hector multiprocessor, show that these approaches are effective. While such facilities will be provided by compilers in the long run, NUMACROS is a helpful interim step.