Optimal Grid Exploitation Algorithms for Data Mining

  • Authors:
  • Valerie Fiolet;Richard Olejnik;Guillem Lefait;Bernard Toursel

  • Affiliations:
  • University of Mons-Hainault, Belgium/ Laboratoire d'Informatique Fondamentale de Lille, France;Universite des Sciences et Technologies de Lille, France;Universite des Sciences et Technologies de Lille, France;Universite des Sciences et Technologies de Lille, France

  • Venue:
  • ISPDC '06 Proceedings of the Proceedings of The Fifth International Symposium on Parallel and Distributed Computing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Although many Data Mining tasks have been parallelized and can thus be executed on dedicated clusters, few solutions currently exist to solve Data Mining problems on a grid or a non-specialized network of workstations. The current tendency is to focus on the use of grids and/or desktop grids in order to exploit any available workstations with no considerations of their physical positions. If a grid specific algorithm has some common characteristics with a dedicated-cluster algorithm, many constraints are inherent to the use of the grid. In particular, resource volatility and communications cost reduce the parallelism effectiveness. The DisDaMin project (DIStributed DAta MINing) revisits the data mining tasks and proposes new exploitable algorithms for grids. The DisDaMin mechanisms first implement a specific fragmentation of the data using clustering methods, and then realize asynchronous collaborative techniques according to the specifics of execution on grids. The use of this fragmentation method makes it possible to carry out optimal local processing on each node, with a minimum of communications. Using this, we introduce the distributed algorithm DICCoop, an adaptation of DIC (see [3]). Simulations were performed to prove the efficiency of the proposed mechanisms and are hosted on the french national grid GRID5000 (part of the European CoreGrid). We analyse the impact of the numerous parameters on optimization of parallel efficiency.