End-to-end framework for fault management for open source clusters: Ranger

  • Authors:
  • John L. Hammond;Tommy Minyard;Jim Browne

  • Affiliations:
  • ICES, University of Texas, Austin, Texas;TACC, University of Texas, Austin, Texas;University of Texas, Austin, Texas

  • Venue:
  • Proceedings of the 2010 TeraGrid Conference
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The scale and complexity of both hardware and software on large open source software systems such as Ranger make occurrence of faults and failures inevitable. What is not inevitable is that they should be allowed to go undetected, nor that diagnosis and recovery from failures should continue to be largely manual and effort intensive. This paper presents a framework for end-to-end fault management for open source clusters which is being developed on Ranger, but which targets general open source software based clusters. The elements of the framework are: a rationalized system logging stack for Linux, low overhead log and status monitoring, and a multilevel suite of diagnostic analyses. This paper describes this framework, presents the accomplishments to date, the results which have been obtained with the elements of the framework which are in place, and the plans for future development including a solicitation for collaboration on the project.