T2: a customizable parallel database for multi-dimensional data

  • Authors:
  • Chialin Chang;Anurag Acharya;Alan Sussman;Joel Saltz

  • Affiliations:
  • Dept. of Computer Science, University of Maryland, College Park, MD;Dept. of Computer Science, University of California, Santa Barbara, CA;Dept. of Computer Science, University of Maryland, College Park, MD;Dept. of Computer Science, University of Maryland, College Park, MD and Dept. of Pathology, Johns Hopkins Medical Institutions, Baltimore, MD

  • Venue:
  • ACM SIGMOD Record
  • Year:
  • 1998

Quantified Score

Hi-index 0.00

Visualization

Abstract

As computational power and storage capacity increase, processingand analyzing large volumes of data play an increasingly importantpart in many domains of scientific research. Typical examples oflarge scientific datasets include long running simulations oftime-dependent phenomena that periodically generate snapshots oftheir state (e.g. hydrodynamics and chemical transport simulationfor estimating pollution impact on water bodies [4, 6, 20],magnetohydrodynamics simulation of planetary magnetospheres [32],simulation of a flame sweeping through a volume [28], airplane wakesimulations [21]), archives of raw and processed remote sensingdata (e.g. AVHRR [25], Thematic Mapper [17], MODIS [22]), andarchives of medical images (e.g. confocal light microscopy, CTimaging, MRI, sonography).These datasets are usually multi-dimensional. The datadimensions can be spatial coordinates, time, or experimentalconditions such as temperature, velocity or magnetic field. Theimportance of such datasets has been recognized by several databaseresearch groups and vendors, and several systems have beendeveloped for managing and/or visualizing them [2, 7, 14, 19, 26,27, 29, 31].These systems, however, focus on lineage management, retrievaland visualization of multi-dimensional datasets. They providelittle or no support for analyzing or processing these datasets --the assumption is that this is too application-specific to warrantcommon support. As a result, applications that process thesedatasets are usually decoupled from data storage and management,resulting in inefficiency due to copying and loss of locality.Furthermore, every application developer has to implement complexsupport for managing and scheduling the processing.Over the past three years, we have been working with severalscientific research groups to understand the processingrequirements for such applications [1, 5, 6, 10, 18, 23, 24, 28].Our study of a large set of applications indicates that theprocessing for such datasets is often highly stylized and sharesseveral important characteristics. Usually, both the input datasetas well as the result being computed have underlyingmulti-dimensional grids, and queries into the dataset are in theform of ranges within each dimension of the grid. The basicprocessing step usually consists of transforming individual inputitems, mapping the transformed items to the output grid andcomputing output items by aggregating, in some way, all thetransformed input items mapped to the corresponding grid point. Forexample, remote-sensing earth images are often generated byperforming atmospheric correction on several days worth of rawtelemetry data, mapping all the data to a latitude-longitude gridand selecting those measurements that provide the clearestview.In this paper, we present T2, a customizable paralleldatabase that integrates storage, retrieval and processing ofmulti-dimensional datasets. T2 provides support for many operationsincluding index generation, data retrieval, memory management,scheduling of processing across a parallel machine and userinteraction. It achieves its primary advantage from the ability toseamlessly integrate data retrieval and processing for a widevariety of applications and from the ability to maintain andprocess multiple datasets with different underlying grids. Mostother systems for multi-dimensional data have focused on uniformlydistributed datasets, such as images, maps, and densemulti-dimensional arrays. Many real datasets, however, arenon-uniform or unstructured. For example, satellite data is a twodimensional strip that is embedded in a three dimensional space;water contamination studies use unstructured meshes to selectivelysimulate regions and so on. T2 can handle both uniform andnon-uniform datasets.T2 has been developed as a set of modular services. Since itsstructure mirrors that of a wide variety of applications, T2 iseasy to customize for different types of processing. To build aversion of T2 customized for a particular application, a user hasto provide functions to pre-process the input data, map input datato elements in the output data, and aggregate multiple input dataitems that map to the same output element.T2 presents a uniform interface to the end users (the clients ofthe database system). Users specify the dataset(s) of interest, aregion of interest within the dataset(s), and the desired formatand resolution of the output. In addition, they select the mappingand aggregation functions to be used. T2 analyzes the user request,builds a suitable plan to retrieve and process the datasets,executes the plan and presents the results in the desiredformat.In Section 2 we first present several motivating applicationsand illustrate their common structure. Section 3 then presents anoverview of T2, including its distinguishing features and a runningexample. Section 4 describes each database service in some detail.An example of how to customize several of the database services fora particular application is given in Section 5. T2 is a system inevolution. We conclude in Section 6 with a description of thecurrent status of both the T2 design and the implementation ofvarious applications with T2.