Portable programs for parallel processors
Portable programs for parallel processors
Data optimization: allocation of arrays to reduce communication on SIMD machines
Journal of Parallel and Distributed Computing - Massively parallel computation
NUMA policies and their relation to memory architecture
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
The robustness of NUMA memory management
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The Stanford Dash Multiprocessor
Computer
SPLASH: Stanford parallel applications for shared-memory
ACM SIGARCH Computer Architecture News
The DASH prototype: implementation and performance
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
APRIL: a processor architecture for multiprocessing
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Monitors: an operating system structuring concept
Communications of the ACM
An Overview of the Fortran D Programming System
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Improving Processor and Cache Locality in Fine-Grain Parallel Computations using Object-Affinity Scheduling and Continuation Passing
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
CRL: high-performance all-software distributed shared memory
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Communication optimizations for parallel computing using data access information
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Commutativity analysis: a new analysis technique for parallelizing compilers
ACM Transactions on Programming Languages and Systems (TOPLAS)
The design, implementation, and evaluation of Jade
ACM Transactions on Programming Languages and Systems (TOPLAS)
Scheduling threads for low space requirement and good locality
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
The data locality of work stealing
Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP
IEEE Transactions on Parallel and Distributed Systems
Pointer and escape analysis for multithreaded programs
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Supporting dynamic data structures with Olden
Compiler optimizations for scalable parallel systems
A hierarchical load-balancing framework for dynamic multithreaded computations
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Pthreads for dynamic and irregular parallelism
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Affinity scheduling of unbalanced workloads
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Analysis of Several Scheduling Algorithms under the Nano-Thread Programming Model
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Fiia: user-centered development of adaptive groupware systems
Proceedings of the 1st ACM SIGCHI symposium on Engineering interactive computing systems
Hi-index | 0.00 |
Large-scale shared memory multiprocessors typically support a multilevel memory hierarchy consisting of per-processor caches, a local portion of shared memory, and remote shared memory. On such machines, the performance of parallel programs is often limited by the high latency of remote memory references. In this paper we explore how knowledge of the underlying memory hierarchy can be used to schedule computation and distribute data structures, and thereby improve data locality. Our study is done in the context of COOL, a concurrent object-oriented language developed at Stanford. We develop abstractions for the programmer to supply optional information about the data reference patterns of the program. This information is used by the runtime system to distribute tasks and objects so that the tasks execute close (in the memory hierarchy) to the objects they reference.We demonstrate the effectiveness of these techniques by applying them to several applications chosen from the SPLASH parallel benchmark suite. Our experience suggests that improving data locality can be simple through a combination of programmer abstractions and smart runtime scheduling.