XG: a data-driven computation grid for enterprise-scale mining

Authors:
Radu Sion;Ramesh Natarajan;Inderpal Narang;Wen-Syan Li;Thomas Phan
Affiliations:
Computer Sciences, Stony Brook University, Stony Brook, NY;IBM TJ Watson Research Lab, Yorktown Heights, NY;IBM Almaden Research Lab, San Jose, CA;IBM Almaden Research Lab, San Jose, CA;IBM Almaden Research Lab, San Jose, CA
Venue:
DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications
Year:
2005

Citing 2
Cited 3

Human Factors and Web Development

Human Factors and Web Development
Explicit control a batch-aware distributed file system

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1

A grid-based approach for enterprise-scale data mining

Future Generation Computer Systems - Special section: Data mining in grid computing environments
A grid-based approach for enterprise-scale data mining

Future Generation Computer Systems - Special section: Data mining in grid computing environments
XG: a grid-enabled query processing engine

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we introduce a novel architecture for data processing, based on a functional fusion between a data and a computation layer. We show how such an architecture can be leveraged to offer significant speedups for data processing jobs such as data analysis and mining over large data sets. One novel contribution of our solution is its data-driven approach. The computation infrastructure is controlled from within the data layer. Grid compute job submission events are based within the query processor on the DBMS side and in effect controlled by the data processing job to be performed. This allows the early deployment of on-the-fly data aggregation techniques, minimizing the amount of data to be transfered to/from compute nodes and is in stark contrast to existing Grid solutions that interact with data layers mainly as external “storage”. We validate this in a scenario derived from a real business deployment, involving financial customer profiling using common types of data analytics (e.g., linear regression analysis). Experimental results show significant speedups. For example, using a grid of only 12 non-dedicated nodes, we observed a speedup of approximately 1000% in a scenario involving complex linear regression analysis data mining computations for commercial customer profiling.