FREERIDE-G: Supporting Applications that Mine Remote FREERIDE-G: Supporting Applications that Mine Remote

Authors:
Leonid Glimcher;Ruoming Jin;Gagan Agrawal
Affiliations:
Ohio State University, USA;Kent State University, USA;Ohio State University, USA
Venue:
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Year:
2006

Citing 0
Cited 1

FREERIDE-G: enabling distributed processing of large datasets

DADC '08 Proceedings of the 2008 international workshop on Data-aware distributed computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Analysis of large geographically distributed scientific datasets, also referred to as distributed data-intensive science, has emerged as an important area in recent years. An application that processes data from a remote repository needs to be broken into several stages, including a data retrieval task at the data repository, a data movement task, and a data processing task at a computing site. Because of the volume of data that is involved and the amount of processing, it is desirable that both the data repository and computing site may be clusters. This can further complicate the development of such data processing applications. In this paper, we present a middleware, FREERIDE-G (FRamework for Rapid Implementation of Datamining Engines in Grid), which support a high-level interface for developing data mining and scientific data processing applications that involve data stored in remote repositories. Particularly, we had the following goals behind designing the FREERIDE-G middleware: 1) Support high-end processing, i.e., use parallel configurations for both hosting the data and processing the data, 2) Ease use of parallel configurations, i.e., support a high-level API for specifying the processing, and 3) Hide details of data movement and caching. We have evaluated our system using three popular data mining algorithms and two scientific data analysis applications. The main observations from our experiments are as follows. First, FREERIDE-G is able to scale the processing extremely well when the number of data server and compute nodes are scaled evenly. Second, when only the number of compute nodes are scaled, our target class of applications achieve modest additional speedups. Finally, for applications that involve multiple passes on the dataset, caching remote data provides significant improvement.