MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce for Data Intensive Scientific Analyses
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Pregel: a system for large-scale graph processing - "ABSTRACT"
Proceedings of the 28th ACM symposium on Principles of distributed computing
Evaluating SPLASH-2 Applications Using MapReduce
APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids
IEEE Transactions on Parallel and Distributed Systems
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduce
Improving MapReduce performance in heterogeneous environments
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
SSS: An Implementation of Key-Value Store Based MapReduce Framework
CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Hadoop: The Definitive Guide
MARIANE: MApReduce Implementation Adapted for HPC Environments
GRID '11 Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing
Hi-index | 0.00 |
Processing large volumes of scientific data requires an efficient and scalable parallel computing framework to obtain meaningful information quickly. In this paper, we evaluate a scientific application from the environmental sciences for its suitability to use the MapReduce framework. We consider cccgistemp -- a Python reimplementation of the original NASA GISS model for estimating global temperature change -- which takes land and ocean temperature records from different sites, removes duplicate records, and adjusts for urbanisation effects before calculating the 12 month running mean global temperature. The application consists of several stages, each displaying differing characteristics, and three stages have been ported to use Hadoop with the mrjob library. We note performance bottlenecks encountered while porting and suggest possible solutions, including modification of data access patterns to overcome uneven distribution of input data.