A data placement strategy in scientific cloud workflows

Authors:
Dong Yuan;Yun Yang;Xiao Liu;Jinjun Chen
Affiliations:
Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne 3122, Australia;Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne 3122, Australia;Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne 3122, Australia;Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne 3122, Australia
Venue:
Future Generation Computer Systems
Year:
2010

Citing 40
Cited 11

Principles of distributed database systems

Principles of distributed database systems
Data clustering: a review

ACM Computing Surveys (CSUR)
Giggle: a framework for constructing scalable replica location services

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Clustering Data Streams: Theory and Practice

IEEE Transactions on Knowledge and Data Engineering
The SDSC storage resource broker

CASCON '98 Proceedings of the 1998 conference of the Centre for Advanced Studies on Collaborative research
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Stork: Making Data Placement a First Class Citizen in the Grid

ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
A grid service broker for scheduling distributed data-oriented applications on global grids

MGC '04 Proceedings of the 2nd workshop on Middleware for grid computing
Scheduling of scientific workflows in the ASKALON grid environment

ACM SIGMOD Record
Taverna: a tool for the composition and enactment of bioinformatics workflows

Bioinformatics
A framework for reliable and efficient data placement in distributed computing systems

Journal of Parallel and Distributed Computing - Special issue: Design and performance of networks for super-, cluster-, and grid-computing: Part I
A taxonomy of Data Grids for distributed data sharing, management, and processing

ACM Computing Surveys (CSUR)
Scientific workflow management and the Kepler system: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Programming scientific and distributed workflow with Triana services: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Computing in the clouds

netWorker - Cloud computing: PC functions move onto the web
Peer-to-Peer Based Grid Workflow Runtime Environment of SwinDeW-G

E-SCIENCE '07 Proceedings of the Third IEEE International Conference on e-Science and Grid Computing
Overhead Analysis of Scientific Workflows in Grid Environments

IEEE Transactions on Parallel and Distributed Systems
An SCP-based heuristic approach for scheduling distributed data-intensive applications on global grids

Journal of Parallel and Distributed Computing
Data Management Challenges of Data-Intensive Scientific Workflows

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
GridBatch: Cloud Computing for Large-Scale Data-Intensive Batch Applications

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
SEA: A Striping-Based Energy-Aware Strategy for Data Placement in RAID-Structured Storage Systems

IEEE Transactions on Computers
Building a database on S3

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Optimizing workflow data footprint

Scientific Programming - Dynamic Computational Workflows: Discovery, Optimization and Scheduling
File grouping for scientific data management: lessons from experimenting with real traces

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Data mining using high performance data clouds: experimental studies using sector and sphere

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
BitDew: a programmable environment for large-scale data management and distribution

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
The cost of doing science on the cloud: the Montage example

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Compute and storage clouds using wide area high performance networks

Future Generation Computer Systems
Market-Oriented Cloud Computing: Vision, Hype, and Reality for Delivering IT Services as Computing Utilities

HPCC '08 Proceedings of the 2008 10th IEEE International Conference on High Performance Computing and Communications
Scientific Cloud Computing: Early Definition and Experience

HPCC '08 Proceedings of the 2008 10th IEEE International Conference on High Performance Computing and Communications
An Algorithm in SwinDeW-C for Scheduling Transaction-Intensive Cost-Constrained Cloud Workflows

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
On the Use of Cloud Computing for Scientific Workflows

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Workflows and e-Science: An overview of workflow system features and capabilities

Future Generation Computer Systems
Data placement for scientific applications in distributed environments

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility

Future Generation Computer Systems
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
Robust data placement in urgent computing environments

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
SwinDeW-a p2p-based decentralized workflow management system

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

On-demand minimum cost benchmarking for intermediate dataset storage in scientific cloud workflow systems

Journal of Parallel and Distributed Computing
A novel statistical time-series pattern based interval forecasting strategy for activity durations in workflow systems

Journal of Systems and Software
Integrated data placement and task assignment for scientific workflows in clouds

Proceedings of the fourth international workshop on Data-intensive distributed computing
Graph-Cut Based Coscheduling Strategy Towards Efficient Execution of Scientific Workflows in Collaborative Cloud Environments

GRID '11 Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing
A data dependency based strategy for intermediate data storage in scientific cloud workflow systems

Concurrency and Computation: Practice & Experience
The retrieval of motion event by associations of temporal frequent pattern growth

Future Generation Computer Systems
Modelling and evaluating a high serviceability fault tolerance strategy in cloud computing environments

International Journal of Security and Networks
Ad-hoc aggregate query processing algorithms based on bit-store for query intensive applications in cloud computing

Future Generation Computer Systems
Resource virtualization methodology for on-demand allocation in cloud computing systems

Service Oriented Computing and Applications
Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments

The Journal of Supercomputing
On solving efficiently the view selection problem under bag and bag-set semantics

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In scientific cloud workflows, large amounts of application data need to be stored in distributed data centres. To effectively store these data, a data manager must intelligently select data centres in which these data will reside. This is, however, not the case for data which must have a fixed location. When one task needs several datasets located in different data centres, the movement of large volumes of data becomes a challenge. In this paper, we propose a matrix based k-means clustering strategy for data placement in scientific cloud workflows. The strategy contains two algorithms that group the existing datasets in k data centres during the workflow build-time stage, and dynamically clusters newly generated datasets to the most appropriate data centres-based on dependencies-during the runtime stage. Simulations show that our algorithm can effectively reduce data movement during the workflow's execution.