Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Authors:
Michela Becchi;Surendra Byna;Srihari Cadambi;Srimat Chakradhar
Affiliations:
NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA
Venue:
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Year:
2010

Citing 14
Cited 7

Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Proceedings of the International Symposium on Code Generation and Optimization
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Compilation for explicitly managed memory hierarchies

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Merge: a programming model for heterogeneous multi-core systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
CUBA: an architecture for efficient CPU/co-processor data communication

Proceedings of the 22nd annual international conference on Supercomputing
Harmony: an execution model and runtime for heterogeneous many core systems

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Supporting OpenMP on cell

International Journal of Parallel Programming
Programming model for a heterogeneous x86 platform

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
A data-locality aware mapping and scheduling framework for data-intensive computing

A data-locality aware mapping and scheduling framework for data-intensive computing
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Learning to rank with (a lot of) word features

Information Retrieval
Least squares quantization in PCM

IEEE Transactions on Information Theory

MDR: performance model driven runtime for heterogeneous parallel platforms

Proceedings of the international conference on Supercomputing
Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework

Proceedings of the 20th international symposium on High performance distributed computing
A virtual memory based runtime to support multi-tenancy in clusters with GPUs

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors

Proceedings of the 26th ACM international conference on Supercomputing
Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Scheduling concurrent applications on a cluster of CPU-GPU nodes

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe a runtime to automatically enhance the performance of applications running on heterogeneous platforms consisting of a multi-core (CPU) and a throughput-oriented many-core (GPU). The CPU and GPU are connected by a non-coherent interconnect such as PCI-E, and as such do not have shared memory. Heterogeneous platforms available today such as [9] are of this type. Our goal is to enable the programmer to seamlessly use such a system without rewriting the application and with minimal knowledge of the underlying architectural details. Assuming that applications perform function calls to computational kernels with available CPU and GPU implementations, our runtime achieves this goal by automatically scheduling the kernels and managing data placement. In particular, it intercepts function calls to well-known computational kernels and schedules them on CPU or GPU based on their argument size and location. To improve performance, it defers all data transfers between the CPU and the GPU until necessary. By managing data placement transparently to the programmer, it provides a unified memory view despite the underlying separate memory sub-systems. We experimentally evaluate our runtime on a heterogeneous platform consisting of a 2.5GHz quad-core Xeon CPU and an NVIDIA C870 GPU. Given array sorting, parallel reduction, dense and sparse matrix operations and ranking as computational kernels, we use our runtime to automatically retarget SSI [25], K-means [32] and two synthetic applications to the above platform with no code changes. We find that, in most cases, performance improves if the computation is moved to the data, and not vice-versa. For instance, even if a particular instance of a kernel is slower on the GPU than on the CPU, the overall application may be faster if the kernel is scheduled on the GPU anyway, especially if the kernel data is already located on the GPU memory due to prior decisions. Our results show that data-aware CPU/GPU scheduling improves performance by up to 25% over the best data-agnostic scheduling on the same platform.