Parallel accessing massive NetCDF data based on mapreduce

Authors:
Hui Zhao;SiYun Ai;ZhenHua Lv;Bo Li
Affiliations:
Key Laboratory of Trustworthy Computing of Shanghai, China and Institute of Software Engineering, East China Normal University Shanghai, China;School of EEE Communication Software & Network, Nanyang Technology University Singapore;Key Laboratory of Geographic Information Science, Ministry of Education, Geography Department, East China Normal University, Shanghai, China;Key Laboratory of Trustworthy Computing of Shanghai, China
Venue:
WISM'10 Proceedings of the 2010 international conference on Web information systems and mining
Year:
2010

Citing 4
Cited 3

Scientific data management in the coming decade

ACM SIGMOD Record
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Short communication: Analysis of self-describing gridded geoscience data with netCDF Operators (NCO)

Environmental Modelling & Software
Pro Hadoop

Pro Hadoop

SciHadoop: array-based query processing in Hadoop

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
SIDR: structure-aware intelligent data routing in Hadoop

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

As a Network Common Data Format, NetCDF has been widely used in terrestrial, marine and atmospheric sciences. A new paralleling storage and access method for large scale NetCDF scientific data is implemented based on Hadoop. The retrieval method is implemented based on MapReduce. The Argo data is used to demonstrate our method. The performance is compared under a distributed environment based on PCs by using different data scale and different task numbers. The experiments result show that the parallel method can be used to store and access the large scale NetCDF efficiently.