DAGuE: A generic distributed DAG engine for High Performance Computing

Authors:
George Bosilca;Aurelien Bouteiller;Anthony Danalis;Thomas Herault;Pierre Lemarinier;Jack Dongarra
Affiliations:
Innovative Computing Laboratory, The University of Tennessee, United States;Innovative Computing Laboratory, The University of Tennessee, United States;Innovative Computing Laboratory, The University of Tennessee, United States;Innovative Computing Laboratory, The University of Tennessee, United States;IRISA, Université de Rennes 1, France;Innovative Computing Laboratory, The University of Tennessee, United States and Oak Ridge National Laboratory, United States
Venue:
Parallel Computing
Year:
2012

Citing 22
Cited 8

A storage-efficient WY representation for products of householder transformations

SIAM Journal on Scientific and Statistical Computing
The Omega test: a fast and practical integer programming algorithm for dependence analysis

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Data flow computing: theory and practice

Data flow computing: theory and practice
The data locality of work stealing

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Matrix algorithms

Matrix algorithms
Operating Systems Theory

Operating Systems Theory
ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance

PARA '95 Proceedings of the Second International Workshop on Applied Parallel Computing, Computations in Physics, Chemistry and Engineering Science
Compact DAG representation and its symbolic scheduling

Journal of Parallel and Distributed Computing
Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed

International Journal of High Performance Computing Applications
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Multi-threading and one-sided communication in parallel LU factorization

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Updating an LU Factorization with Pivoting

ACM Transactions on Mathematical Software (TOMS)
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
Distributed SBP Cholesky factorization algorithms with near-optimal scheduling

ACM Transactions on Mathematical Software (TOMS)
Workflow Global Computing with YML

GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The impact of multicore on math software

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications

PDP '10 Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Concurrency and Computation: Practice & Experience - Euro-Par 2009

Enabling large-scale scientific workflows on petascale resources using MPI master/worker

Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
From serial loops to parallel execution on distributed systems

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
ViperVM: a runtime system for parallel functional high-performance computing on heterogeneous architectures

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Parallelizing the execution of sequential scripts

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Feature-based analysis of large-scale spatio-temporal sensor data on hybrid architectures

International Journal of High Performance Computing Applications
Symbolic mapping and allocation for the Cholesky factorization on NUMA machines: Results and optimizations

International Journal of High Performance Computing Applications
Multifrontal QR factorization for multicore architectures over runtime systems

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Performance models and workload distribution algorithms for optimizing a hybrid CPU-GPU multifrontal solver

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The frenetic development of the current architectures places a strain on the current state-of-the-art programming environments. Harnessing the full potential of such architectures is a tremendous task for the whole scientific computing community. We present DAGuE a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures. Applications we consider can be expressed as a Direct Acyclic Graph of tasks with labeled edges designating data dependencies. DAGs are represented in a compact, problem-size independent format that can be queried on-demand to discover data dependencies, in a totally distributed fashion. DAGuE assigns computation threads to the cores, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on cache awareness, data-locality and task priority. We demonstrate the efficiency of our approach, using several micro-benchmarks to analyze the performance of different components of the framework, and a linear algebra factorization as a use case.