Towards efficient data search and subsetting of large-scale atmospheric datasets

  • Authors:
  • Sangmi Lee Pallickara;Shrideep Pallickara;Milija Zupanski

  • Affiliations:
  • Department of Computer Science, Colorado State University, United States;Department of Computer Science, Colorado State University, United States;Cooperative Institute for Research in the Atmosphere, Colorado State University, United States

  • Venue:
  • Future Generation Computer Systems
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Discovering the correct dataset in an efficient fashion is critical for effective simulations in the atmospheric sciences. Unlike text-based web documents, many of the large scientific datasets often contain binary encoded data that is hard to discover using popular search engines. In the atmospheric sciences, there has been a significant growth in public data hosting services. However, the ability to index and search has been limited by the metadata provided by the data host. We have developed an infrastructure-Atmospheric Data Discovery System (ADDS)-that provides an efficient data discovery environment for observational datasets in the atmospheric sciences. To support complex querying capabilities, we automatically extract and index fine-grained metadata. Datasets are indexed based on periodic crawling of popular sites and also of files requested by the users. Users are allowed to access subsets of a large dataset through our data customization feature. Our focus is the overall architecture, data subsetting scheme, and a performance evaluation of our system.