An implementation of parallel file distribution in an agent hierarchy

Authors:
Munehiro Fukuda;Jumpei Miyauchi
Affiliations:
Computing & Software Systems, University of Washington, Bothell, USA 98011;Computer Science, Ehime University, Matsuyama, Japan 790-8577
Venue:
The Journal of Supercomputing
Year:
2009

Citing 18
Cited 0

Disconnected operation in the Coda file system

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Server-directed collective I/O in Panda

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Disk-directed I/O for MIMD multiprocessors

ACM Transactions on Computer Systems (TOCS)
GASS: a data movement and access service for wide area computing systems

Proceedings of the sixth workshop on I/O in parallel and distributed systems
Reliable File Transfer in Grid Environments

LCN '02 Proceedings of the 27th Annual IEEE Conference on Local Computer Networks
Data Sieving and Collective I/O in ROMIO

FRONTIERS '99 Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation
Nimrod: a tool for performing parametrised simulations using distributed workstations

HPDC '95 Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing
Resource Co-Allocation in Computational Grids

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Grid-Based File Access: The Legion I/O Model

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing

MSS '01 Proceedings of the Eighteenth IEEE Symposium on Mass Storage Systems and Technologies
High Performance Parametric Modeling with Nimrod/G: Killer Application for the Global Grid?

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Integrating fault-tolerance techniques in grid applications

Integrating fault-tolerance techniques in grid applications
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
File-based replica management

Future Generation Computer Systems
The Globus Striped GridFTP Framework and Server

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A Parallel File Transfer Protocol for Clusters and Grid Systems

E-SCIENCE '05 Proceedings of the First International Conference on e-Science and Grid Computing
AgentTeamwork: Coordinating grid-computing jobs with mobile agents

Applied Intelligence
Extended mpijava for distributed checkpointing and recovery

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

PC grid is a cost-effective grid-computing platform that attracts users by allocating to their massively parallel applications as many desktop computers as requested. However, a challenge is how to distribute necessary files to remote computing nodes that may be unconnected to the same network file system, equipped with insufficient disk space to keep entire files, and even powered off asynchronously.Targeting PC grid, the AgentTeamwork grid-computing middleware deploys a hierarchy of mobile agents to remote desktops so as to launch, monitor, check-point, and resume a parallel and distributed computing job. To achieve high-speed file distribution, AgentTeamwork takes advantage of its agent hierarchy. The system partitions files into stripes at the tree root if they are random-access files, duplicates them at each tree level if they are shared among all remote nodes, fragments them into smaller messages if they are too large to relay to a lower tree level, aggregates such messages in a larger fragment if they are in transit to the same subtree, and returns output files to the user along multi-paths established within the tree. To achieve fault-tolerant file delivery, each agent periodically takes a snapshot of in-transit and on-memory file messages with its user job, and thus resumes them from the latest snapshot when they crash accidentally.This paper presents an implementation and its competitive performance of AgentTeamwork's file-distribution algorithm including file partitioning, transfer, check-pointing, and consistency maintenance.