An architecture for a data-intensive computer

  • Authors:
  • Edward Givelberg;Alexander Szalay;Kalin Kanov;Randal Burns

  • Affiliations:
  • The Johns Hopkins University, baltimore, MD, USA;The Johns Hopkins University, baltimore, MD, USA;The Johns Hopkins University, baltimore, MD, USA;The Johns Hopkins University, baltimore, MD, USA

  • Venue:
  • Proceedings of the first international workshop on Network-aware data management
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Scientific instruments, as well as simulations, generate increasingly large datasets, changing the way we do science. We propose a system that we call the data-intensive computer for computing with Petascale-sized datasets. The data-intensive computer consists of an HPC cluster, a massively parallel database and a set of computing servers running the data-intensive operating system, which turns the database into a layer in the memory hierarchy of the data-intensive computer. The data-intensive operating system is data-object-oriented: the abstract programming model of a sequential file, central to traditional computer operating systems, is replaced with system-level support for high-level data objects, such as multi-dimensional arrays, graphs, sparse arrays, etc. User application programs will be compiled into code that is executed both on the HPC cluster and inside the database. The data-intensive operating system is however non-local, allowing remote applications to execute code inside the database. This model supports the collaborative environment, where a large data set is typically created and processed by a large group of users. We are developing a software library, MPI-DB, which is a prototype of the data-intensive operating system. It is currently being used by the Turbulence group at JHU to store simulation output in the database and to perform simulations refining previously stored results.