Exploiting replication and data reuse to efficiently schedule data-intensive applications on grids

Authors:
Elizeu Santos-Neto;Walfredo Cirne;Francisco Brasileiro;Aliandro Lima
Affiliations:
Universidade Federal de Campina Grande;Universidade Federal de Campina Grande;Universidade Federal de Campina Grande;Universidade Federal de Campina Grande
Venue:
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Year:
2004

Citing 18
Cited 29

Efficient robust parallel computations

STOC '90 Proceedings of the twenty-second annual ACM symposium on Theory of computing
The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
Heuristic Algorithms for Scheduling Independent Tasks on Nonidentical Processors

Journal of the ACM (JACM)
Adaptive performance prediction for distributed data-intensive applications

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
The network weather service: a distributed resource performance forecasting service for metacomputing

Future Generation Computer Systems - Special issue on metacomputing
OceanStore: an architecture for global-scale persistent storage

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
NILE: wide-area computing for high energy physics

EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
Predicting Queue Times on Space-Sharing Parallel Computers

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
A Historical Application Profiler for Use by Parallel Schedulers

IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
A Comparative Study of Real Workload Traces and Synthetic Workload Models for Parallel Job Scheduling

IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Predicting Application Run Times Using Historical Information

IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Metrics and Benchmarking for Parallel Job Scheduling

IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
A System for Fault-Tolerance Execution of Data and Compute Intensive Programs over a Network of Workstations

Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing - Volume I
Bandwidth-Centric Allocation of Independent Tasks on Heterogeneous Platforms

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Simgrid: A Toolkit for the Simulation of Application Scheduling

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Heuristics for Scheduling Parameter Sweep Applications in Grid Environments

HCW '00 Proceedings of the 9th Heterogeneous Computing Workshop
Predicting the CPU Availability of Time-Shared Unix Systems on the Computational Grid

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Metric and Workload Effects on Computer Systems Evaluation

Computer

On the efficacy, efficiency and emergent behavior of task replication in large distributed systems

Parallel Computing
A multi-dimensional scheduling scheme in a Grid computing environment

Journal of Parallel and Distributed Computing
Scheduling Independent Tasks Sharing Large Data Distributed with BitTorrent

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Practical Scheduling of Bag-of-Tasks Applications on Grids with Dynamic Resilience

IEEE Transactions on Computers
Scheduling data-intensive bags of tasks in P2P grids with bittorrent-enabled data distribution

Proceedings of the second workshop on Use of P2P, GRID and agents for the development of content networks
Automatic grid assembly by promoting collaboration in peer-to-peer grids

Journal of Parallel and Distributed Computing
Sandboxing for a free-to-join grid with support for secure site-wide storage area

VTDC '06 Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing
Allocation strategies for utilization of space-shared resources in Bag of Tasks grids

Future Generation Computer Systems
Load distribution of analytical query workloads for database cluster architectures

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
BigBatch: a document processing platform for clusters and grids

Proceedings of the 2008 ACM symposium on Applied computing
Efficient reuse of replicated parallel data segments in computational grids

Future Generation Computer Systems
enabling cross-layer optimizations in storage systems with custom metadata

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
A probabilistic and adaptive scheduling algorithm using system-generated predictions for inter-grid resource sharing

The Journal of Supercomputing
A Probabilistic Approach for Fully Decentralized Resource Management for Grid Systems

Information Networking. Towards Ubiquitous Networking and Services
Improving scalability of Bag-of-Tasks applications running on master-slave platforms

Parallel Computing
New worker-centric scheduling strategies for data-intensive grid applications

Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
Efficient on-demand operations in dynamic distributed infrastructures

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
BitDew: A data management and distribution service with multi-protocol file transfer and metadata abstraction

Journal of Network and Computer Applications
P2P file sharing for P2P computing

Multiagent and Grid Systems - Content management and delivery through P2P-based content networks
Scheduling data-intensive workflows on storage constrained resources

Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
New worker-centric scheduling strategies for data-intensive grid applications

MIDDLEWARE2007 Proceedings of the 8th ACM/IFIP/USENIX international conference on Middleware
Availability Prediction Based Replication Strategies for Grid Environments

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Fast and scalable simulation of volunteer computing systems using SimGrid

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Scalability limits of Bag-of-Tasks applications running on hierarchical platforms

Journal of Parallel and Distributed Computing
HistDoc v. 2.0: enhancing a platform to process historical documents

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Minimizing data size for efficient data reuse in grid-enabled medical applications

ISBMDA'06 Proceedings of the 7th international conference on Biological and Medical Data Analysis
BigBatch – an environment for processing monochromatic documents

ICIAR'06 Proceedings of the Third international conference on Image Analysis and Recognition - Volume Part II
Evolving toward the perfect schedule: co-scheduling job assignments and data replication in wide-area systems using a genetic algorithm

JSSPP'05 Proceedings of the 11th international conference on Job Scheduling Strategies for Parallel Processing
Fair scheduling of bag-of-tasks applications using distributed Lagrangian optimization

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data-intensive applications executing over a computational grid demand large data transfers. These are costly operations. Therefore, taking them into account is mandatory to achieve efficient scheduling of data-intensive applications on grids. Further, within a heterogeneous and ever changing environment such as a grid, better schedules are typically attained by heuristics that use dynamic information about the grid and the applications. However, this information is often difficult to be accurately obtained. On the other hand, although there are schedulers that attain good performance without requiring dynamic information, they were not designed to take data transfer into account. This paper presents Storage Affinity, a novel scheduling heuristic for bag-of-tasks data-intensive applications running on grid environments. Storage Affinity exploits a data reuse pattern, common on many data-intensive applications, that allows it to take data transfer delays into account and reduce the makespan of the application. Further, it uses a replication strategy that yields efficient schedules without relying upon dynamic information that is difficult to obtain. Our results show that Storage Affinity may attain better performance than the state-of-the-art knowledge-dependent schedulers. This is achieved at the expense of consuming more CPU cycles and network bandwidth.