Implementing fault-tolerant services using the state machine approach: a tutorial
ACM Computing Surveys (CSUR)
ACM SIGOPS Operating Systems Review
The dangers of replication and a solution
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Practical Byzantine fault tolerance
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
The Byzantine Generals Problem
ACM Transactions on Programming Languages and Systems (TOPLAS)
Byzantine generals in action: implementing fail-stop processors
ACM Transactions on Computer Systems (TOCS)
Transaction Processing: Concepts and Techniques
Transaction Processing: Concepts and Techniques
Fast Algorithms for Maintaining Replica Consistency in Lazy Master Replicated Databases
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Understanding Replication in Databases and Distributed Systems
ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Ganymed: scalable replication for transactional web applications
Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware
Systems research challenges: a scale-out perspective
IBM Journal of Research and Development
Chain replication for supporting high throughput and availability
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Sprint: a middleware for high-performance transaction processing
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Dynamo: amazon's highly available key-value store
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware
Parallel programming framework for large batch transaction processing on scale-out systems
Proceedings of the 3rd Annual Haifa Experimental Systems Conference
Hi-index | 0.00 |
There is a growing need for efficient distributed computing for transaction processing. One of the key requirements for runtime systems in distributed environments is fault tolerance. Such a system needs to preserve the data consistency at transaction boundaries so as to resume the ongoing tasks from checkpoints with consistent data for any component failure. Another key requirement is that the system needs to be lightweight enough in normal execution to provide scalable performance. This paper presents the design and implementation of a new fault tolerant execution framework that addresses both of these requirements. We replicate each partition of the distributed persistent data on three nodes (triplet) with two different types of backups, one using warm replication and the other using cold replication. For node failures, the system is automatically recoverable unless all three nodes in any triplet fail at the same time. The system tolerates simultaneous two-node failures in any triplet most of the cases. We obtained a new trade-off in that 43% performance improvements can be achieved by slightly compromising the system availability.