Case study of scientific data processing on a cloud using hadoop

Authors:
Chen Zhang;Hans De Sterck;Ashraf Aboulnaga;Haig Djambazian;Rob Sladek
Affiliations:
David R. Cheriton School of Computer Science, University of Waterloo, Ontario, Canada;Department of Applied Mathematics, University of Waterloo, Ontario, Canada;David R. Cheriton School of Computer Science, University of Waterloo, Ontario, Canada;McGill University and Genome Quebec Innovation Centre, Montreal, Quebec, Canada;McGill University and Genome Quebec Innovation Centre, Montreal, Quebec, Canada
Venue:
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Year:
2009

Citing 21
Cited 4

The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Toward a doctrine of containment: grid hosting with adaptive resource control

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Sharing networked resources with brokered leases

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MRPSO: MapReduce particle swarm optimization

Proceedings of the 9th annual conference on Genetic and evolutionary computation
Sinfonia: a new paradigm for building scalable distributed systems

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Building a database on S3

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Automatic virtual machine configuration for database workloads

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Clustera: an integrated computation and data management system

Proceedings of the VLDB Endowment
A practical scalable distributed B-tree

Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
PNUTS: Yahoo!'s hosted data serving platform

Proceedings of the VLDB Endowment
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
CloudWF: A Computational Workflow System for Clouds Based on Hadoop

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing

CloudWF: A Computational Workflow System for Clouds Based on Hadoop

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
A cloud computing implementation of XML indexing method using hadoop

ACIIDS'12 Proceedings of the 4th Asian conference on Intelligent Information and Database Systems - Volume Part III
Medical (visual) information retrieval

PROMISE'12 Proceedings of the 2012 international conference on Information Retrieval Meets Information Visualization
SIDR: structure-aware intelligent data routing in Hadoop

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the increasing popularity of cloud computing, Hadoop has become a widely used open source cloud computing framework for large scale data processing. However, few efforts have been made to demonstrate the applicability of Hadoop to various real-world application scenarios in fields other than server side computations such as web indexing, etc. In this paper, we use the Hadoop cloud computing framework to develop a user application that allows processing of scientific data on clouds. A simple extension to Hadoop’s MapReduce is described which allows it to handle scientific data processing problems with arbitrary input formats and explicit control over how the input is split. This approach is used to develop a Hadoop-based cloud computing application that processes sequences of microscope images of live cells, and we test its performance. It is discussed how the approach can be generalized to more complicated scientific data processing problems.