Efficient virtualization of scientific data

Authors:
Joel Saltz;Sivaramakrishnan Narayanan
Affiliations:
The Ohio State University;The Ohio State University
Venue:
Efficient virtualization of scientific data
Year:
2008

Citing 0
Cited 1

High-performance systems for in silico microscopy imaging studies

DILS'10 Proceedings of the 7th international conference on Data integration in the life sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

Availability of commodity components has made it affordable to build beowulf clusters with multi-level memory hierarchies, large storage space and computing resources. These systems have helped fuel data-intensive scientific applications like bio-informatics, imaging and oil reservoir and seismic simulation studies. These applications require generation and analysis of large multi-dimensional datasets. A highly abstracted view of this data would benefit scientists by hiding the complexity of data layout and integration and simplifying the process of developing analysis algorithms. Certain analysis processes require access to attribute values at individual points in the multi-dimensional space. A table-like view of the data would allow these processes to express their region of interest using queries on attributes. The scale of scientific datasets poses I/O and computational problems in providing structural access to datasets. Manual and automatic techniques like segmentation may be employed to annotate interesting regions that represent important concepts in the domain in question. This allows for more sophisticated semantic-access to the scientific datasets using terms that a scientist is familar with. The meaning of these terms is captured in ontologies and rules with well-defined semantics. Semantic-access allows the scientist to not only query the dataset using explicit annotations, but also on implicit annotations. Computing implicit annotations from explicit annotations and domain knowledge is called materialization. Efficiency in materialization and querying implicit information is necessary to allow semantic-access over large scientific datasets. In my research, I have investigated structural and semantic access to scientific datasets. Providing such accesses over large scientific datasets poses both I/O and computational problems. In the context of structural access, I have developed a middleware framework that can exploit a shared-nothing cluster configuration to optimize performance. The framework utilizes additional disk space to replicate parts of the dataset to improve performance and chooses the appropriate join algorithm based on dataset and system parameters. In the context of semantic-access, I have investigated the problem of materializing large ontologies using databases and querying them using spatial predicates. I have also investigated optimizing materialization of ontologies in a shared-nothing cluster environment. Our results indicate that coarse-grained parallelism may be exploited in many scenarios to provide efficient structural and semantic access to scientific datasets.