Data locality and load balancing in COOL

Authors:
Rohit Chandra;Anoop Gupta;John L. Hennessy
Affiliations:
-;-;-
Venue:
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
1993

Citing 13
Cited 16

Portable programs for parallel processors

Portable programs for parallel processors
Data optimization: allocation of arrays to reduce communication on SIMD machines

Journal of Parallel and Distributed Computing - Massively parallel computation
NUMA policies and their relation to memory architecture

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Techniques for improving the performance of sparse matrix factorization on multiprocessor workstations

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
The robustness of NUMA memory management

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The Stanford Dash Multiprocessor

Computer
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
The DASH prototype: implementation and performance

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
APRIL: a processor architecture for multiprocessing

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Monitors: an operating system structuring concept

Communications of the ACM
An Overview of the Fortran D Programming System

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Improving Processor and Cache Locality in Fine-Grain Parallel Computations using Object-Affinity Scheduling and Continuation Passing

Improving Processor and Cache Locality in Fine-Grain Parallel Computations using Object-Affinity Scheduling and Continuation Passing

Concert-efficient runtime support for concurrent object-oriented programming languages on stock hardware

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
CRL: high-performance all-software distributed shared memory

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Communication optimizations for parallel computing using data access information

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Commutativity analysis: a new analysis technique for parallelizing compilers

ACM Transactions on Programming Languages and Systems (TOPLAS)
The design, implementation, and evaluation of Jade

ACM Transactions on Programming Languages and Systems (TOPLAS)
Scheduling threads for low space requirement and good locality

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
The data locality of work stealing

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP

IEEE Transactions on Parallel and Distributed Systems
Pointer and escape analysis for multithreaded programs

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Supporting dynamic data structures with Olden

Compiler optimizations for scalable parallel systems
A hierarchical load-balancing framework for dynamic multithreaded computations

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Pthreads for dynamic and irregular parallelism

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Affinity scheduling of unbalanced workloads

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
COOL: An Object-Based Language for Parallel Programming

Computer
Analysis of Several Scheduling Algorithms under the Nano-Thread Programming Model

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Fiia: user-centered development of adaptive groupware systems

Proceedings of the 1st ACM SIGCHI symposium on Engineering interactive computing systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale shared memory multiprocessors typically support a multilevel memory hierarchy consisting of per-processor caches, a local portion of shared memory, and remote shared memory. On such machines, the performance of parallel programs is often limited by the high latency of remote memory references. In this paper we explore how knowledge of the underlying memory hierarchy can be used to schedule computation and distribute data structures, and thereby improve data locality. Our study is done in the context of COOL, a concurrent object-oriented language developed at Stanford. We develop abstractions for the programmer to supply optional information about the data reference patterns of the program. This information is used by the runtime system to distribute tasks and objects so that the tasks execute close (in the memory hierarchy) to the objects they reference.We demonstrate the effectiveness of these techniques by applying them to several applications chosen from the SPLASH parallel benchmark suite. Our experience suggests that improving data locality can be simple through a combination of programmer abstractions and smart runtime scheduling.