Analysis of a new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors

  • Authors:
  • Ajay Dholakia;Evangelos Eleftheriou;Xiao--Yu Hu;Ilias Iliadis;Jai Menon;KK Rao

  • Affiliations:
  • IBM Syst. and Tech. Group, Research Triangle Park, NC;IBM Zurich Research Lab, Rüschlikon, Switzerland;IBM Zurich Research Lab, Rüschlikon, Switzerland;IBM Zurich Research Lab, Rüschlikon, Switzerland;IBM Syst. and Tech. Group, San Jose, CA;IBM Almaden Research Center, San Jose, CA

  • Venue:
  • SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Today's data storage systems are increasingly adopting low-cost disk drives that have higher capacity but lower reliability, leading to more frequent rebuilds and to a higher risk of unrecoverable media errors. We propose a new XOR-based intra-disk redundancy scheme, called interleaved parity check (IPC), to enhance the reliability of RAID systems that incurs only negligible I/O performance degradation. The proposed scheme introduces an additional level of redundancy inside each disk, on top of the RAID redundancy across multiple disks. The RAID parity provides protection against disk failures, while the proposed scheme aims to protect against media-related unrecoverable errors.We develop a new model capturing the effect of correlated unrecoverable sector errors and subsequently use it to analyze the proposed scheme as well as the traditional redundancy schemes based on Reed-Solomon (RS) codes and single-parity-check (SPC) codes. We derive closed-form expressions for the mean time to data loss (MTTDL) of RAID 5 and RAID 6 systems in the presence of unrecoverable errors and disk failures. We then combine these results for a comprehensive characterization of the reliability of RAID systems that incorporate the proposed IPC redundancy scheme. Our results show that in the practical case of correlated errors, the proposed scheme provides the same reliability as the optimum albeit more complex RS coding scheme. Finally, the throughput performance of incorporating the intra-disk redundancy on various RAID systems is evaluated by means of event-driven simulations. A detailed description of these contributions is given in [1].