Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Authors:
Chi-Keung Luk;Sunpyo Hong;Hyesoon Kim
Affiliations:
Intel Corporation, Hudson, MA;Georgia Institute of Technology, Atlanta, GA;Georgia Institute of Technology, Atlanta, GA
Venue:
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2009

Citing 21
Cited 74

Profile guided code positioning

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Adaptive optimization in the Jalapeño JVM

OOPSLA '00 Proceedings of the 15th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Ispike: A Post-link Optimizer for the Intel®Itanium®Architecture

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Scheduling for heterogeneous processors in server systems

Proceedings of the 2nd conference on Computing frontiers
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Heterogeneous Chip Multiprocessors

Computer
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Proceedings of the International Symposium on Code Generation and Optimization
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Merge: a programming model for heterogeneous multi-core systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
PEAK—a fast and effective performance tuning system via compiler optimization orchestration

ACM Transactions on Programming Languages and Systems (TOPLAS)
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Iterative optimization in the polyhedral model: part ii, multidimensional time

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Amdahl's Law in the Multicore Era

Computer
A tuning framework for software-managed memory hierarchies

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Intel threading building blocks

Intel threading building blocks
Supporting OpenMP on cell

International Journal of Parallel Programming
Mapping parallelism to multi-cores: a machine learning based approach

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Evaluating iterative compilation

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing

Elastic computing: a framework for transparent, portable, and adaptive multi-core heterogeneous computing

Proceedings of the ACM SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools for embedded systems
Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

Proceedings of the 24th ACM International Conference on Supercomputing
Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Run-time optimizations for replicated dataflows on heterogeneous environments

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Partitioning streaming parallelism for multi-cores: a machine learning based approach

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
TH-1: China's first petaflop supercomputer

Frontiers of Computer Science in China
Cost-aware function migration in heterogeneous systems

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Attaining system performance points: revisiting the end-to-end argument in system design for heterogeneous many-core systems

ACM SIGOPS Operating Systems Review
Parallelizing compiler framework and API for power reduction and software productivity of real-time heterogeneous multicores

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
A static task partitioning approach for heterogeneous systems using OpenCL

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Parallelism orchestration using DoPE: the degree of parallelism executive

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
MDR: performance model driven runtime for heterogeneous parallel platforms

Proceedings of the international conference on Supercomputing
Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework

Proceedings of the 20th international symposium on High performance distributed computing
Pegasus: coordinated scheduling for virtualized accelerator-based systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Introducing scalable quantum approaches in language representation

QI'11 Proceedings of the 5th international conference on Quantum interaction
Seamlessly portable applications: Managing the diversity of modern heterogeneous systems

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Reflex: using low-power processors in smartphones without knowing them

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Enabling task-level scheduling on heterogeneous platforms

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Improving performance of adaptive component-based dataflow middleware

Parallel Computing
A compiler and runtime for heterogeneous computing

Proceedings of the 49th Annual Design Automation Conference
A survey on hardware-aware and heterogeneous computing on multicore processors and accelerators

Concurrency and Computation: Practice & Experience
Boosting single thread performance in mobile processors via reconfigurable acceleration

ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
Compiling a high-level language for GPUs: (via language support for architectures and compilers)

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Parcae: a system for flexible parallel execution

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Compiler and runtime support for enabling reduction computations on heterogeneous systems

Concurrency and Computation: Practice & Experience
A virtual memory based runtime to support multi-tenancy in clusters with GPUs

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation

Proceedings of the 26th ACM international conference on Supercomputing
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

Proceedings of the 26th ACM international conference on Supercomputing
Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Elastic computing: A portable optimization framework for hybrid computers

Parallel Computing
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer

Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
Optimizing dataflow applications on heterogeneous environments

Cluster Computing
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Power-efficient time-sensitive mapping in heterogeneous systems

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Workload and power budget partitioning for single-chip heterogeneous processors

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
The RACECAR heuristic for automatic function specialization on multi-core heterogeneous systems

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Performance enhancement under power constraints using heterogeneous CMOS-TFET multicores

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Adaptive data parallelism for internet clients on heterogeneous platforms

Proceedings of the 8th symposium on Dynamic languages
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Accelerating MapReduce on a coupled CPU-GPU architecture

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ValuePack: value-based scheduling framework for CPU-GPU clusters

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Encapsulated synchronization and load-balance in heterogeneous programming

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Accelerating text mining workloads in a MapReduce-based distributed GPU environment

Journal of Parallel and Distributed Computing
Accelerating thread-intensive and explicit memory management programs with dynamic partial reconfiguration

The Journal of Supercomputing
CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Prius: a runtime for hybrid computing

Proceedings of the First International Workshop on Code OptimiSation for MultI and many Cores
Portable performance on heterogeneous architectures

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
An automatic input-sensitive approach for heterogeneous task partitioning

Proceedings of the 27th international ACM conference on International conference on supercomputing
SemCache: semantics-aware caching for efficient GPU offloading

Proceedings of the 27th international ACM conference on International conference on supercomputing
Glinda: a framework for accelerating imbalanced applications on heterogeneous platforms

Proceedings of the ACM International Conference on Computing Frontiers
Load balancing in a changing world: dealing with heterogeneity and performance variability

Proceedings of the ACM International Conference on Computing Frontiers
Cooperative boosting: needy versus greedy power management

Proceedings of the 40th Annual International Symposium on Computer Architecture
Arbiter work stealing for parallelizing games on heterogeneous computing environments

Proceedings of the High Performance Computing Symposium
General transformations for GPU execution of tree traversals

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Effective sampling-driven performance tools for GPU-accelerated supercomputers

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Using machine learning to partition streaming programs

ACM Transactions on Architecture and Code Optimization (TACO)
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
RSVM: a region-based software virtual memory for GPU

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Scheduling concurrent applications on a cluster of CPU-GPU nodes

Future Generation Computer Systems
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures

ACM Transactions on Architecture and Code Optimization (TACO)
Extending a Run-time Resource Management framework to support OpenCL and Heterogeneous Systems

Proceedings of Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms
Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

Cluster Computing
Dynamic Partitioning-based JPEG Decompression on Heterogeneous Multicore Architectures

Proceedings of Programming Models and Applications on Multicores and Manycores
MOMA: mapping of memory-intensive software-pipelined applications for systems with multiple memory controllers

Proceedings of the International Conference on Computer-Aided Design
Improving application behavior on heterogeneous manycore systems through kernel mapping

Parallel Computing
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Proceedings of Workshop on General Purpose Processing Using GPUs
CPU+GPU scheduling with asymptotic profiling

Parallel Computing
Boosting CUDA Applications with CPU---GPU Hybrid Computing

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Heterogeneous multiprocessors are increasingly important in the multi-core era due to their potential for high performance and energy efficiency. In order for software to fully realize this potential, the step that maps computations to processing elements must be as automated as possible. However, the state-of-the-art approach is to rely on the programmer to specify this mapping manually and statically. This approach is not only labor intensive but also not adaptable to changes in runtime environments like problem sizes and hardware/software configurations. In this study, we propose adaptive mapping, a fully automatic technique to map computations to processing elements on a CPU+GPU machine. We have implemented it in our experimental heterogeneous programming system called Qilin. Our results show that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduction in energy consumption than static mappings on average for a set of important computation benchmarks. We also demonstrate that our technique is able to adapt to changes in the input problem size and system configuration.