MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
International Journal of High Performance Computing Applications
kMemvisor: flexible system wide memory mirroring in virtual environments
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Hi-index | 0.00 |
Fault-tolerance is rapidly becoming a crucial issue in high-end and distributed computing, as increasing number of cores are decreasing the mean-time to failure of the systems. In this work, we present an algorithm-based fault tolerance solution that handles fail-stop failures for a class of iterative data intensive algorithms. We intelligently replicate the data to minimize data loss in multiple failures and decrease re-execution in recovery by little modifications in the algorithms. We evaluate our approach by using two data mining algorithms, kmeans and Apriori. We show that our approach has negligible overhead and allows us to gracefully handle different number of failures. In addition, our approach outperforms Hadoop both in absence and presence of failures.