Increasing relevance of memory hardware errors: a case for recoverable programming models

  • Authors:
  • Dejan Milojicic;Alan Messer;James Shau;Guangrui Fu;Alberto Munoz

  • Affiliations:
  • HP Labs, MS 1U-18, Palo Alto, CA;HP Labs, MS 1U-18, Palo Alto, CA;HP Labs, MS 1U-18, Palo Alto, CA;HP Labs, MS 1U-18, Palo Alto, CA;HP Labs, MS 1U-18, Palo Alto, CA

  • Venue:
  • EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
  • Year:
  • 2000

Quantified Score

Hi-index 0.02

Visualization

Abstract

It is a common belief that most of computer system failures nowadays stem from programming errors. Computer systems are becoming more complex and harder to maintain and administer, making software errors an even more common case, while contemporary computer architectures are optimized for price and performance and not for availability. In this paper, we raise a case for an increasing relevance of memory hardware soft-errors. In particular with the introduction of 64-bit processors, memory scaling is significantly increased, resulting in higher probability for memory errors. At the same time, due to the ubiquitous use of computers, such as at higher altitudes, environmental conditions impact errors (terrestrial cosmic rays). Finally, in shared memory systems, the failure of one node's memory can take the whole machine down. Current commodity systems do not tolerate memory errors, neither commodity hardware (processors, memories, interconnects) nor software (operating systems, applications, application environments). At the same time, users expect increased reliability. We present the problems of such errors and some solutions for memory error recovery at the processor, operating system and programming model level.