Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems

Authors:
Jianwu Wang;Daniel Crawl;Ilkay Altintas
Affiliations:
University of California, San Diego, La Jolla, CA;University of California, San Diego, La Jolla, CA;University of California, San Diego, La Jolla, CA
Venue:
Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
Year:
2009

Citing 12
Cited 19

Taverna: a tool for the composition and enactment of bioinformatics workflows

Bioinformatics
Scientific workflow management and the Kepler system: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Introduction and evaluation of Martlet: a scientific workflow language for abstracted parallelisation

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Advanced data flow support for scientific grid workflow applications

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Data-Intensive Computing in the 21st Century

Computer
MapReduce for Data Intensive Scientific Analyses

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
MRGIS: A MapReduce-Enabled High Performance Workflow System for GIS

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
CloudBurst

Bioinformatics
A MapReduce-Enabled Scientific Workflow Composition Framework

ICWS '09 Proceedings of the 2009 IEEE International Conference on Web Services
Accelerating Parameter Sweep Workflows by Utilizing Ad-hoc Network Computing Resources: An Ecological Example

SERVICES '09 Proceedings of the 2009 Congress on Services - I

Automated component-level evaluation: present and future

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Workflows for metabolic flux analysis: data integration and human interaction

ISoLA'10 Proceedings of the 4th international conference on Leveraging applications of formal methods, verification, and validation - Volume Part I
A MapReduce workflow system for architecting scientific data intensive applications

Proceedings of the 2nd International Workshop on Software Engineering for Cloud Computing
Nova: continuous Pig/Hadoop workflows

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
SciHadoop: array-based query processing in Hadoop

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Prediction-based auto-scaling of scientific workflows

Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science
Distributed workflow-driven analysis of large-scale biological data using biokepler

Proceedings of the 2nd international workshop on Petascal data analytics: challenges and opportunities
Provenance for MapReduce-based data-intensive workflows

Proceedings of the 6th workshop on Workflows in support of large-scale science
ModeleR: An enviromental model repository as knowledge base for experts

Expert Systems with Applications: An International Journal
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Challenges and approaches for distributed workflow-driven analysis of large-scale biological data: vision paper

Proceedings of the 2012 Joint EDBT/ICDT Workshops
ProvManager: a provenance management system for scientific workflows

Concurrency and Computation: Practice & Experience
A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds

Journal of Grid Computing
Report from the first workshop on scalable workflow enactment engines and technology (SWEET'12)

ACM SIGMOD Record
Exploiting geospatial and chronological characteristics in data streams to enable efficient storage and retrievals

Future Generation Computer Systems
Oozie: towards a scalable workflow management system for Hadoop

Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
A continuous workflow scheduling framework

Proceedings of the 2nd ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
SIDR: structure-aware intelligent data routing in Hadoop

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Approaches to Distributed Execution of Scientific Workflows in Kepler

Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce provides a parallel and scalable programming model for data-intensive business and scientific applications. MapReduce and its de facto open source project, called Hadoop, support parallel processing on large datasets with capabilities including automatic data partitioning and distribution, load balancing, and fault tolerance management. Meanwhile, scientific workflow management systems, e.g., Kepler, Taverna, Triana, and Pegasus, have demonstrated their ability to help domain scientists solve scientific problems by synthesizing different data and computing resources. By integrating Hadoop with Kepler, we provide an easy-to-use architecture that facilitates users to compose and execute MapReduce applications in Kepler scientific workflows. Our implementation demonstrates that many characteristics of scientific workflow management systems, e.g., graphical user interface and component reuse and sharing, are very complementary to those of MapReduce. Using the presented Hadoop components in Kepler, scientists can easily utilize MapReduce in their domain-specific problems and connect them with other tasks in a workflow through the Kepler graphical user interface. We validate the feasibility of our approach via a word count use case.