Data-intensive science: The Terapixel and MODISAzure projects

Authors:
Deb Agarwal;You-Wei Cheah;Dan Fay;Jonathan Fay;Dean Guo;Tony Hey;Marty Humphrey;Keith Jackson; Jie Li;Christophe Poulain;Youngryel Ryu;Catharine Van Ingen
Affiliations:
Lawrence Berkeley National Lab, USA;School of Informatics and Computing, Indiana University,USA;Microsoft Research, USA;Microsoft Research, USA;Microsoft Research, USA;Microsoft Research, USA;Department of Computer Science, University of Virginia,USA;Lawrence Berkeley National Lab, USA;Department of Computer Science, University of Virginia,USA;Microsoft Research, USA;Department of Organismic and Evolutionary Biology, HarvardUniversity, USA;Microsoft Research, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2011

Citing 4
Cited 0

Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Distributed gradient-domain processing of planar and spherical images

ACM Transactions on Graphics (TOG)
Bridging the Gap between Desktop and the Cloud for eScience Applications

CLOUD '10 Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We live in an era in which scientific discovery is increasingly driven by data exploration of massive datasets. Scientists today are envisioning diverse data analyses and computations that scale from the desktop to supercomputers, yet often have difficulty designing and constructing software architectures to accommodate the heterogeneous and often inconsistent data at scale. Moreover, scientific data and computational resource needs can vary widely over time. The needs grow as the science collaboration broadens or as additional data is accumulated; the computational demand can have large transients in response to seasonal field campaigns or new instrumentation breakthroughs. Cloud computing can offer a scalable, economic, on-demand model that is well matched to some of these evolving science needs. This paper presents two of our experiences over the last year â聙聰 the Terapixel Project, using workflow, high-performance computing and non-structured query language data processing to render the largest astronomical image for the WorldWide Telescope, and MODISAzure, a science pipeline for image processing, deployed using the Azure Cloud infrastructure.