Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults

  • Authors:
  • Vilas Sridharan;Jon Stearley;Nathan DeBardeleben;Sean Blanchard;Sudhanva Gurumurthi

  • Affiliations:
  • RAS Architecture, Advanced Micro Devices, Inc., Boxborough, MA;Scalable Architectures, Sandia National Laboratories, Albuquerque, New Mexico;Ultrascale Systems Research Center, Los Alamos National Laboratory, Los Alamos, New Mexico;Ultrascale Systems Research Center, Los Alamos National Laboratory, Los Alamos, New Mexico;Advanced Micro Devices, Inc., Boxborough, MA

  • Venue:
  • SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Several recent publications confirm that faults are common in high-performance computing systems. Therefore, further attention to the faults experienced by such computing systems is warranted. In this paper, we present a study of DRAM and SRAM faults in large high-performance computing systems. Our goal is to understand the factors that influence faults in production settings. We examine the impact of aging on DRAM, finding a marked shift from permanent to transient faults in the first two years of DRAM lifetime. We examine the impact of DRAM vendor, finding that fault rates vary by more than 4x among vendors. We examine the physical location of faults in a DRAM device and in a data center; contrary to prior studies, we find no correlations with either. Finally, we study the impact of altitude and rack placement on SRAM faults, finding that, as expected, altitude has a substantial impact on SRAM faults, and that top of rack placement correlates with 20% higher fault rate.