Merge: a programming model for heterogeneous multi-core systems

Authors:
Michael D. Linderman;Jamison D. Collins;Hong Wang;Teresa H. Meng
Affiliations:
Stanford University, Stanford, CA;Intel Corporation, Santa Clara, CA;Intel Corporation, Santa Clara, CA;Stanford University, Stanford, CA
Venue:
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Year:
2008

Citing 20
Cited 64

A metaobject protocol for C++

Proceedings of the tenth annual conference on Object-oriented programming systems, languages, and applications
An annotation language for optimizing software libraries

Proceedings of the 2nd conference on Domain-specific languages
Role analysis

POPL '02 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Predicate Dispatching: A Unified Theory of Dispatch

ECCOP '98 Proceedings of the 12th European Conference on Object-Oriented Programming
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Cg: a system for programming graphics hardware in a C-like language

ACM SIGGRAPH 2003 Papers
Programmable Stream Processors

Computer
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Metaprogramming GPUs with Sh

Metaprogramming GPUs with Sh
Practical predicate dispatch

OOPSLA '04 Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Best of Both Latency and Throughput

ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Mitigating Amdahl's Law through EPI Throttling

Proceedings of the 32nd annual international symposium on Computer Architecture
Multiple Instruction Stream Processor

Proceedings of the 33rd annual international symposium on Computer Architecture
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Performance evaluation of GPUs using the RapidMind development platform

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture

Harmony: an execution model and runtime for heterogeneous many core systems

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Mars: a MapReduce framework on graphics processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Accuracy and performance of graphics processors: A Quantum Monte Carlo application case study

Parallel Computing
GViM: GPU-accelerated virtual machines

Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing
A Light-Weight Approach to Dynamical Runtime Linking Supporting Heterogenous, Parallel, and Reconfigurable Architectures

ARCS '09 Proceedings of the 22nd International Conference on Architecture of Computing Systems
A Seamless Virtualization Approach for Transparent Dynamical Function Mapping Targeting Heterogeneous and Reconfigurable Systems

ARC '09 Proceedings of the 5th International Workshop on Reconfigurable Computing: Architectures, Tools and Applications
Supporting MapReduce on large-scale asymmetric multi-core clusters

ACM SIGOPS Operating Systems Review
A Unified Runtime System for Heterogeneous Multi-core Architectures

Euro-Par 2008 Workshops - Parallel Processing
The canals language and its compiler

Proceedings of th 12th International Workshop on Software and Compilers for Embedded Systems
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
An Embrace-and-Extend Approach to Managing the Complexity of Future Heterogeneous Systems

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A Skeletal Parallel Framework with Fusion Optimizer for GPGPU Programming

APLAS '09 Proceedings of the 7th Asian Symposium on Programming Languages and Systems
Accelerating SQL database operations on a GPU with CUDA

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
High-throughput Bayesian network learning using heterogeneous multicore computers

Proceedings of the 24th ACM International Conference on Supercomputing
Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture
Designing Accelerator-Based Distributed Systems for High Performance

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A Capabilities-Aware Programming Model for Asymmetric High-End Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Run-time optimizations for replicated dataflows on heterogeneous environments

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
MapCG: writing parallel program portable between CPU and GPU

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Embracing heterogeneity: parallel programming for changing hardware

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
A capabilities-aware framework for using computational accelerators in data-intensive computing

Journal of Parallel and Distributed Computing
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A domain-specific approach to heterogeneous parallelism

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Cost-aware function migration in heterogeneous systems

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
A static task partitioning approach for heterogeneous systems using OpenCL

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
MDR: performance model driven runtime for heterogeneous parallel platforms

Proceedings of the international conference on Supercomputing
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
The accelerator store: A shared memory framework for accelerator-based systems

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Using explicit platform descriptions to support programming of heterogeneous many-core systems

Parallel Computing
Reflex: using low-power processors in smartphones without knowing them

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Tarazu: optimizing MapReduce on heterogeneous clusters

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Parallelization of Belief Propagation on Cell Processors for Stereo Vision

ACM Transactions on Embedded Computing Systems (TECS)
Exploiting parallelism in deterministic shared memory multiprocessing

Journal of Parallel and Distributed Computing
Improving performance of adaptive component-based dataflow middleware

Parallel Computing
A survey on hardware-aware and heterogeneous computing on multicore processors and accelerators

Concurrency and Computation: Practice & Experience
A virtual memory based runtime to support multi-tenancy in clusters with GPUs

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
A Map-Reduce Based Framework for Heterogeneous Processing Element Cluster Environments

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer

Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
VForce: An environment for portable applications on high performance systems with accelerators

Journal of Parallel and Distributed Computing
Optimizing dataflow applications on heterogeneous environments

Cluster Computing
Operating systems should manage accelerators

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Workload and power budget partitioning for single-chip heterogeneous processors

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling

ACM Transactions on Architecture and Code Optimization (TACO)
Prius: a runtime for hybrid computing

Proceedings of the First International Workshop on Code OptimiSation for MultI and many Cores
An automatic input-sensitive approach for heterogeneous task partitioning

Proceedings of the 27th international ACM conference on International conference on supercomputing
SemCache: semantics-aware caching for efficient GPU offloading

Proceedings of the 27th international ACM conference on International conference on supercomputing
Glinda: a framework for accelerating imbalanced applications on heterogeneous platforms

Proceedings of the ACM International Conference on Computing Frontiers
Load balancing in a changing world: dealing with heterogeneity and performance variability

Proceedings of the ACM International Conference on Computing Frontiers
MapReduce with communication overlap (MaRCO)

Journal of Parallel and Distributed Computing
Temperature aware thread block scheduling in GPGPUs

Proceedings of the 50th Annual Design Automation Conference
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
RSVM: a region-based software virtual memory for GPU

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Efficient implementation of data flow graphs on multi-gpu clusters

Journal of Real-Time Image Processing
Boosting CUDA Applications with CPU---GPU Hybrid Computing

International Journal of Parallel Programming
From physics model to results: An optimizing framework for cross-architecture code generation

Scientific Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we propose the Merge framework, a general purpose programming model for heterogeneous multi-core systems. The Merge framework replaces current ad hoc approaches to parallel programming on heterogeneous platforms with a rigorous, library-based methodology that can automatically distribute computation across heterogeneous cores to achieve increased energy and performance efficiency. The Merge framework provides (1) a predicate dispatch-based library system for managing and invoking function variants for multiple architectures; (2) a high-level, library-oriented parallel language based on map-reduce; and (3) a compiler and runtime which implement the map-reduce language pattern by dynamically selecting the best available function implementations for a given input and machine configuration. Using a generic sequencer architecture interface for heterogeneous accelerators, the Merge framework can integrate function variants for specialized accelerators, offering the potential for to-the-metal performance for a wide range of heterogeneous architectures, all transparent to the user. The Merge framework has been prototyped on a heterogeneous platform consisting of an Intel Core 2 Duo CPU and an 8-core 32-thread Intel Graphics and Media Accelerator X3000, and a homogeneous 32-way Unisys SMP system with Intel Xeon processors. We implemented a set of benchmarks using the Merge framework and enhanced the library with X3000 specific implementations, achieving speedups of 3.6x -- 8.5x using the X3000 and 5.2x -- 22x using the 32-way system relative to the straight C reference implementation on a single IA32 core.