Maestro: a self-organizing peer-to-peer dataflow framework using reinforcement learning

Authors:
C. van Reeuwijk
Affiliations:
Vrije Universiteit Amsterdam, Amsterdam, Netherlands
Venue:
Proceedings of the 18th ACM international symposium on High performance distributed computing
Year:
2009

Citing 18
Cited 2

Technical Note: \cal Q-Learning

Machine Learning
Efficient load balancing for wide-area divide-and-conquer applications

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
A History of Data-Flow Languages

IEEE Annals of the History of Computing
Bandwidth-Centric Allocation of Independent Tasks on Heterogeneous Platforms

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Dataflow Java: Implicitly Parallel Java

ACAC '00 Proceedings of the 5th Australasian Computer Architecture Conference
GridFlow: Workflow Management for Grid Computing

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Grid Economy Comes of Age: Emerging Gridbus Tools for Service-Oriented Cluster and Grid Computing

P2P '02 Proceedings of the Second International Conference on Peer-to-Peer Computing
Symphony - A Java-Based Composition and Manipulation Framework for Computational Grids

CCGRID '02 Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid
BOINC: A System for Public-Resource Computing and Storage

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Distributed computing in practice: the Condor experience: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
Lowering the barriers to programming: A taxonomy of programming environments and languages for novice programmers

ACM Computing Surveys (CSUR)
Taverna: a tool for the composition and enactment of bioinformatics workflows

Bioinformatics
Application-specific scheduling for the organic grid

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
A comprehensive review of nature inspired routing algorithms for fixed telecommunication networks

Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Nature-inspired applications and systems
Streamflex: high-throughput stream programming in java

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
User-friendly and reliable grid computing based on imperfect middleware

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Resource tracking in parallel and distributed applications

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Developing java grid applications with ibis

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

Towards jungle computing with Ibis/Constellation

Proceedings of the 2011 workshop on Dynamic distributed data-intensive applications, programming abstractions, and systems
Chapter 12: panta rhei: flexible execution engine for search computing queries

Search Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we describe Maestro, a dataflow computation framework for Ibis, our Java-based grid middleware. The novelty of Maestro is that it is a self-organizing peer-to-peer system, meaning that it distributes the tasks in a flow over the available nodes based on local decisions on each node, without any central coordination. As a result, the computations are more scalable, more resilient against failing nodes, and less sensitive to communication latencies. Maestro uses a task distribution approach based on reinforcement learning, a learning mechanism where the positive outcome of a choice makes it more likely that the same choice repeated in the future. Maestro selects the most efficient node for each stage in the computation based on the observed computation and communication times. To ensure agility, the selection decisions are made as late as possible without letting the nodes fall idle. Using this task distribution algorithm, the nodes can be used efficiently, even in a heterogeneous system with failure-prone nodes communicating through high-latency connections.