T2: a customizable parallel database for multi-dimensional data

Authors:
Chialin Chang;Anurag Acharya;Alan Sussman;Joel Saltz
Affiliations:
Dept. of Computer Science, University of Maryland, College Park, MD;Dept. of Computer Science, University of California, Santa Barbara, CA;Dept. of Computer Science, University of Maryland, College Park, MD;Dept. of Computer Science, University of Maryland, College Park, MD and Dept. of Pathology, Johns Hopkins Medical Institutions, Baltimore, MD
Venue:
ACM SIGMOD Record
Year:
1998

Citing 0
Cited 16

Efficient input and output for scientific simulations

Proceedings of the sixth workshop on I/O in parallel and distributed systems
Querying very large multi-dimensional datasets in ADR

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A hypergraph-partitioning approach for coarse-grain decomposition

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Data parallel language and compiler support for data intensive applications

Parallel Computing - Parallel data-intensive algorithms and applications
An efficient association mining implementation on clusters of SMP

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Compiling Data Intensive Applications with Spatial Coordinates

LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Query processing techniques for arrays

The VLDB Journal — The International Journal on Very Large Data Bases
Implementing data cube construction using a cluster middleware: algorithms, implementation experience, and performance evaluation

Future Generation Computer Systems - Selected papers from CCGRID 2002
References

Sourcebook of parallel computing
Middleware for data mining applications on clusters and grids

Journal of Parallel and Distributed Computing
Compiler and middleware support for scalable data mining

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Sidera: a cluster-based server for online analytical processing

OTM'07 Proceedings of the 2007 OTM confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part II
Hybrid merge/overlap execution technique for parallel array processing

Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases
ArrayStore: a storage manager for complex parallel array processing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
The open connectome project data cluster: scalable analysis and vision for high-throughput neuroscience

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Astronomical data processing in EXTASCID

Proceedings of the 25th International Conference on Scientific and Statistical Database Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

As computational power and storage capacity increase, processingand analyzing large volumes of data play an increasingly importantpart in many domains of scientific research. Typical examples oflarge scientific datasets include long running simulations oftime-dependent phenomena that periodically generate snapshots oftheir state (e.g. hydrodynamics and chemical transport simulationfor estimating pollution impact on water bodies [4, 6, 20],magnetohydrodynamics simulation of planetary magnetospheres [32],simulation of a flame sweeping through a volume [28], airplane wakesimulations [21]), archives of raw and processed remote sensingdata (e.g. AVHRR [25], Thematic Mapper [17], MODIS [22]), andarchives of medical images (e.g. confocal light microscopy, CTimaging, MRI, sonography).These datasets are usually multi-dimensional. The datadimensions can be spatial coordinates, time, or experimentalconditions such as temperature, velocity or magnetic field. Theimportance of such datasets has been recognized by several databaseresearch groups and vendors, and several systems have beendeveloped for managing and/or visualizing them [2, 7, 14, 19, 26,27, 29, 31].These systems, however, focus on lineage management, retrievaland visualization of multi-dimensional datasets. They providelittle or no support for analyzing or processing these datasets --the assumption is that this is too application-specific to warrantcommon support. As a result, applications that process thesedatasets are usually decoupled from data storage and management,resulting in inefficiency due to copying and loss of locality.Furthermore, every application developer has to implement complexsupport for managing and scheduling the processing.Over the past three years, we have been working with severalscientific research groups to understand the processingrequirements for such applications [1, 5, 6, 10, 18, 23, 24, 28].Our study of a large set of applications indicates that theprocessing for such datasets is often highly stylized and sharesseveral important characteristics. Usually, both the input datasetas well as the result being computed have underlyingmulti-dimensional grids, and queries into the dataset are in theform of ranges within each dimension of the grid. The basicprocessing step usually consists of transforming individual inputitems, mapping the transformed items to the output grid andcomputing output items by aggregating, in some way, all thetransformed input items mapped to the corresponding grid point. Forexample, remote-sensing earth images are often generated byperforming atmospheric correction on several days worth of rawtelemetry data, mapping all the data to a latitude-longitude gridand selecting those measurements that provide the clearestview.In this paper, we present T2, a customizable paralleldatabase that integrates storage, retrieval and processing ofmulti-dimensional datasets. T2 provides support for many operationsincluding index generation, data retrieval, memory management,scheduling of processing across a parallel machine and userinteraction. It achieves its primary advantage from the ability toseamlessly integrate data retrieval and processing for a widevariety of applications and from the ability to maintain andprocess multiple datasets with different underlying grids. Mostother systems for multi-dimensional data have focused on uniformlydistributed datasets, such as images, maps, and densemulti-dimensional arrays. Many real datasets, however, arenon-uniform or unstructured. For example, satellite data is a twodimensional strip that is embedded in a three dimensional space;water contamination studies use unstructured meshes to selectivelysimulate regions and so on. T2 can handle both uniform andnon-uniform datasets.T2 has been developed as a set of modular services. Since itsstructure mirrors that of a wide variety of applications, T2 iseasy to customize for different types of processing. To build aversion of T2 customized for a particular application, a user hasto provide functions to pre-process the input data, map input datato elements in the output data, and aggregate multiple input dataitems that map to the same output element.T2 presents a uniform interface to the end users (the clients ofthe database system). Users specify the dataset(s) of interest, aregion of interest within the dataset(s), and the desired formatand resolution of the output. In addition, they select the mappingand aggregation functions to be used. T2 analyzes the user request,builds a suitable plan to retrieve and process the datasets,executes the plan and presents the results in the desiredformat.In Section 2 we first present several motivating applicationsand illustrate their common structure. Section 3 then presents anoverview of T2, including its distinguishing features and a runningexample. Section 4 describes each database service in some detail.An example of how to customize several of the database services fora particular application is given in Section 5. T2 is a system inevolution. We conclude in Section 6 with a description of thecurrent status of both the T2 design and the implementation ofvarious applications with T2.