A Gracefully Degrading Massively Parallel System Using the BSP Model, and Its Evaluation

  • Authors:
  • Andreas Savva;Takashi Nanya

  • Affiliations:
  • Fujitsu Ltd., Kawasaki, Japan;Univ. of Tokyo, Tokyo, Japan

  • Venue:
  • IEEE Transactions on Computers
  • Year:
  • 1999

Quantified Score

Hi-index 14.98

Visualization

Abstract

The Bulk-Synchronous Parallel (BSP) Model was proposed as a unifying model for parallel computation. By using Randomized Shared Memory (RSM), the model offers an asymptotically optimal emulation of the Parallel Random Access Machine (PRAM). By using the BSP model with RSM, we construct a gracefully degrading massively parallel system using a fault tolerance (FT) scheme that relies on memory duplication to ensure global memory integrity and to speed up the reconfiguration. After a fault occurs, global reconfiguration restores the logical properties of the system. Work done during reconfiguration is shared equally among the live processors, with minimal coordination. We analyze, at the level of the BSP model, how the performance of a system may change as processors fail and the performance of the interconnection network degrades. We relate the change in overall system performance to the change in computation and communication load on the live processors. Further, we show how to estimate the overhead imposed by the FT scheme. We evaluate the reconfiguration time, the overhead, and graceful degradation of the system experimentally by an implementation on a Massively Parallel Processor (MPP). We show that the predictions about the degradation of the system and the overhead cost of the scheme are accurate