Towards efficient data search and subsetting of large-scale atmospheric datasets

Authors:
Sangmi Lee Pallickara;Shrideep Pallickara;Milija Zupanski
Affiliations:
Department of Computer Science, Colorado State University, United States;Department of Computer Science, Colorado State University, United States;Cooperative Institute for Research in the Atmosphere, Colorado State University, United States
Venue:
Future Generation Computer Systems
Year:
2012

Citing 6
Cited 1

Data Management: NetCDF: an Interface for Scientific Data Access

IEEE Computer Graphics and Applications
Cooperating Services for Data-Driven Computational Experimentation

Computing in Science and Engineering
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
An Overview of the Granules Runtime for Cloud Computing

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Efficient Metadata Generation to Enable Interactive Data Discovery over Large-Scale Scientific Data Collections

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science

A framework for user driven data management

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Discovering the correct dataset in an efficient fashion is critical for effective simulations in the atmospheric sciences. Unlike text-based web documents, many of the large scientific datasets often contain binary encoded data that is hard to discover using popular search engines. In the atmospheric sciences, there has been a significant growth in public data hosting services. However, the ability to index and search has been limited by the metadata provided by the data host. We have developed an infrastructure-Atmospheric Data Discovery System (ADDS)-that provides an efficient data discovery environment for observational datasets in the atmospheric sciences. To support complex querying capabilities, we automatically extract and index fine-grained metadata. Datasets are indexed based on periodic crawling of popular sites and also of files requested by the users. Users are allowed to access subsets of a large dataset through our data customization feature. Our focus is the overall architecture, data subsetting scheme, and a performance evaluation of our system.