Optimizing dataflow applications on heterogeneous environments

Authors:
George Teodoro;Timothy D. Hartley;Umit V. Catalyurek;Renato Ferreira
Affiliations:
Dept. of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil;Depts. of Biomedical Informatics, and Electrical & Computer Engineering, The Ohio State University, Columbus, USA;Depts. of Biomedical Informatics, and Electrical & Computer Engineering, The Ohio State University, Columbus, USA;Dept. of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Venue:
Cluster Computing
Year:
2012

Citing 33
Cited 0

Performance Prediction and Calibration for a Class of Multiprocessors

IEEE Transactions on Computers
A dynamic network architecture

ACM Transactions on Computer Systems (TOCS)
A static parameter based performance prediction tool for parallel programs

ICS '93 Proceedings of the 7th international conference on Supercomputing
View-dependent refinement of progressive meshes

Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Coyote: a system for constructing fine-grain configurable communication services

ACM Transactions on Computer Systems (TOCS)
Cluster I/O with River: making the fast case common

Proceedings of the sixth workshop on I/O in parallel and distributed systems
Application-level scheduling on distributed heterogeneous networks

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Understanding TCP vegas: a duality model

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
SEDA: an architecture for well-conditioned, scalable internet services

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Distributed processing of very large datasets with DataCutter

Parallel Computing - Clusters and computational grids for scientific computing
Predictive performance and scalability modeling of a large-scale application

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Microarchitecture of HaL's CPU

COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
A Parallel Implementation of 4-Dimensional Haralick Texture Analysis for Disk-Resident Image Datasets

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Optimizing Reduction Computations In a Distributed Environment

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A Run-time System for Efficient Execution of Scientific Workflows on Distributed Environments

SBAC-PAD '06 Proceedings of the 18th International Symposium on Computer Architecture and High Performance Computing
Performance effective pre-scheduling strategy for heterogeneous grid systems in the master slave paradigm

Future Generation Computer Systems
An Efficient and Reliable Scientific Workflow System

CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Pathological Image Analysis Using the GPU: Stroma Classification for Neuroblastoma

BIBM '07 Proceedings of the 2007 IEEE International Conference on Bioinformatics and Biomedicine
Merge: a programming model for heterogeneous multi-core systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Biomedical image analysis on a cooperative cluster of GPUs and multicores

Proceedings of the 22nd annual international conference on Supercomputing
Achieving Multi-Level Parallelism in the Filter-Labeled Stream Programming Model

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Mars: a MapReduce framework on graphics processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Computer-aided prognosis of neuroblastoma on whole-slide images: Classification of stromal development

Pattern Recognition
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A framework for efficient and scalable execution of domain-specific templates on GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Exploiting Computational Resources in Distributed Heterogeneous Platforms

SBAC-PAD '09 Proceedings of the 2009 21st International Symposium on Computer Architecture and High Performance Computing
Run-time optimizations for replicated dataflows on heterogeneous environments

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Simulations of the electrical activity in the heart with graphic processing units

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
The virtual microscope

IEEE Transactions on Information Technology in Biomedicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increases in multi-core processor parallelism and in the flexibility of many-core accelerator processors, such as GPUs, have turned traditional SMP systems into hierarchical, heterogeneous computing environments. Fully exploiting these improvements in parallel system design remains an open problem. Moreover, most of the current tools for the development of parallel applications for hierarchical systems concentrate on the use of only a single processor type (e.g., accelerators) and do not coordinate several heterogeneous processors. Here, we show that making use of all of the heterogeneous computing resources can significantly improve application performance. Our approach, which consists of optimizing applications at run-time by efficiently coordinating application task execution on all available processing units is evaluated in the context of replicated dataflow applications. The proposed techniques were developed and implemented in an integrated run-time system targeting both intra- and inter-node parallelism. The experimental results with a real-world complex biomedical application show that our approach nearly doubles the performance of the GPU-only implementation on a distributed heterogeneous accelerator cluster.