Processing large-scale multi-dimensional data in parallel and distributed environments

  • Authors:
  • Michael Beynon;Chialin Chang;Umit Catalyurek;Tahsin Kurc;Alan Sussman;Henrique Andrade;Renato Ferreira;Joel Saltz

  • Affiliations:
  • Department of Computer Science, University of Maryland, College Park, MD;Department of Computer Science, University of Maryland, College Park, MD;Department of Biomedical Informatics, The Ohio State University, Columbus, OH;Department of Biomedical Informatics, The Ohio State University, Columbus, OH;Department of Computer Science, University of Maryland, College Park, MD;Department of Computer Science, University of Maryland, College Park, MD;Department of Computer Science, University of Maryland, College Park, MD;Department of Biomedical Informatics, The Ohio State University, Columbus, OH

  • Venue:
  • Parallel Computing - Parallel data-intensive algorithms and applications
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Analysis of data is an important step in understanding and solving a scientific problem. Analysis involves extracting the data of interest from all the available raw data in a dataset and processing it into a data product. However, in many areas of science and engineering, a scientist's ability to analyze information is increasingly becoming hindered by dataset sizes. The vast amount of data in scientific datasets makes it a difficult task to efficiently access the data of interest, and manage potentially heterogeneous system resources to process the data. Subsetting and aggregation are common operations executed in a wide range of data-intensive applications. We argue that common runtime and programming support can be developed for applications that query and manipulate large datasets. This paper presents a compendium of frameworks and methods we have developed to support efficient execution of subsetting and aggregation operations in applications that query and manipulate large, multi-dimensional datasets in parallel and distributed computing environments.