Dynamic failure management for parallel applications on grids

  • Authors:
  • Hyungsoo Jung;Dongin Shin;Hyeongseog Kim;Hyuck Han;Inseon Lee;Heon Y. Yeom

  • Affiliations:
  • School of Computer Science and Engineering, Institute of Computer Technology, Seoul National University, Seoul, Korea;School of Computer Science and Engineering, Institute of Computer Technology, Seoul National University, Seoul, Korea;School of Computer Science and Engineering, Institute of Computer Technology, Seoul National University, Seoul, Korea;School of Computer Science and Engineering, Institute of Computer Technology, Seoul National University, Seoul, Korea;School of Computer Science and Engineering, Institute of Computer Technology, Seoul National University, Seoul, Korea;School of Computer Science and Engineering, Institute of Computer Technology, Seoul National University, Seoul, Korea

  • Venue:
  • EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The computational grid, as it is today, is vulnerable to node failures and the probability of a node failure rapidly grows as the size of the grid increases. There have been several attempts to provide fault tolerance using checkpointing and message logging in conjunction with the MPI library. However, the Grid itself should be active in dealing with the failures. We propose a dynamic reconfigurable architecture where the applications can regroup in the face of a failure. The proposed architecture removes the single point of failure from the computational grids and provides flexibility in terms of grid configuration.