A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
SafeStore: a durable and practical storage system
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Hi-index | 0.00 |
Considering a large part of node failures in a storage clusters cannot actually destroy data in disks and even some failed nodes can soon recover, a policy that deferring a reconstruction until recover during a certain time after a node failure can lessen unnecessary data rebuilding process is absolutely possible and favorable, but it also undoubtedly introduces a certain risk of data loss. In this paper, according to differences in the way setting delay time, we mainly present two algorithms of delaying reconstruction: static and dynamic. A qualitative approach is proposed to analyze the reliability, risk and benefit of these two methods. Numerical results show that under a certain distribution function of repair time for failed nodes, the static method exists an optimal delay time to leverage the risk and benefit. Moreover, the dynamic method has better risk control than static, but its cost is to increase the possibility of launching reconstruction.