Efficient virtualization of scientific data

  • Authors:
  • Joel Saltz;Sivaramakrishnan Narayanan

  • Affiliations:
  • The Ohio State University;The Ohio State University

  • Venue:
  • Efficient virtualization of scientific data
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Availability of commodity components has made it affordable to build beowulf clusters with multi-level memory hierarchies, large storage space and computing resources. These systems have helped fuel data-intensive scientific applications like bio-informatics, imaging and oil reservoir and seismic simulation studies. These applications require generation and analysis of large multi-dimensional datasets. A highly abstracted view of this data would benefit scientists by hiding the complexity of data layout and integration and simplifying the process of developing analysis algorithms. Certain analysis processes require access to attribute values at individual points in the multi-dimensional space. A table-like view of the data would allow these processes to express their region of interest using queries on attributes. The scale of scientific datasets poses I/O and computational problems in providing structural access to datasets. Manual and automatic techniques like segmentation may be employed to annotate interesting regions that represent important concepts in the domain in question. This allows for more sophisticated semantic-access to the scientific datasets using terms that a scientist is familar with. The meaning of these terms is captured in ontologies and rules with well-defined semantics. Semantic-access allows the scientist to not only query the dataset using explicit annotations, but also on implicit annotations. Computing implicit annotations from explicit annotations and domain knowledge is called materialization. Efficiency in materialization and querying implicit information is necessary to allow semantic-access over large scientific datasets. In my research, I have investigated structural and semantic access to scientific datasets. Providing such accesses over large scientific datasets poses both I/O and computational problems. In the context of structural access, I have developed a middleware framework that can exploit a shared-nothing cluster configuration to optimize performance. The framework utilizes additional disk space to replicate parts of the dataset to improve performance and chooses the appropriate join algorithm based on dataset and system parameters. In the context of semantic-access, I have investigated the problem of materializing large ontologies using databases and querying them using spatial predicates. I have also investigated optimizing materialization of ontologies in a shared-nothing cluster environment. Our results indicate that coarse-grained parallelism may be exploited in many scenarios to provide efficient structural and semantic access to scientific datasets.