An architecture for a data-intensive computer

Authors:
Edward Givelberg;Alexander Szalay;Kalin Kanov;Randal Burns
Affiliations:
The Johns Hopkins University, baltimore, MD, USA;The Johns Hopkins University, baltimore, MD, USA;The Johns Hopkins University, baltimore, MD, USA;The Johns Hopkins University, baltimore, MD, USA
Venue:
Proceedings of the first international workshop on Network-aware data management
Year:
2011

Citing 9
Cited 1

Designing and mining multi-terabyte astronomy archives: the Sloan Digital Sky Survey

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A comprehensive three-dimensional model of the cochlea

Journal of Computational Physics
Distributed/Heterogeneous Query Processing in Microsoft SQL Server

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Distributed Immersed Boundary Simulation in Titanium

SIAM Journal on Scientific Computing
Multicollective I/O: A technique for exploiting inter-file access patterns

ACM Transactions on Storage (TOS)
UDT: UDP-based data transfer for high-speed wide area networks

Computer Networks: The International Journal of Computer and Telecommunications Networking
Data exploration of turbulence simulations using a database cluster

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Array requirements for scientific applications and an implementation for microsoft SQL server

Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases
The architecture of SciDB

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management

Run-time creation of the turbulent channel flow database by an HPC simulation using MPI-DB

Proceedings of the 20th European MPI Users' Group Meeting

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific instruments, as well as simulations, generate increasingly large datasets, changing the way we do science. We propose a system that we call the data-intensive computer for computing with Petascale-sized datasets. The data-intensive computer consists of an HPC cluster, a massively parallel database and a set of computing servers running the data-intensive operating system, which turns the database into a layer in the memory hierarchy of the data-intensive computer. The data-intensive operating system is data-object-oriented: the abstract programming model of a sequential file, central to traditional computer operating systems, is replaced with system-level support for high-level data objects, such as multi-dimensional arrays, graphs, sparse arrays, etc. User application programs will be compiled into code that is executed both on the HPC cluster and inside the database. The data-intensive operating system is however non-local, allowing remote applications to execute code inside the database. This model supports the collaborative environment, where a large data set is typically created and processed by a large group of users. We are developing a software library, MPI-DB, which is a prototype of the data-intensive operating system. It is currently being used by the Turbulence group at JHU to store simulation output in the database and to perform simulations refining previously stored results.