DIB—a distributed implementation of backtracking
ACM Transactions on Programming Languages and Systems (TOPLAS)
Epidemic algorithms for replicated database maintenance
ACM SIGOPS Operating Systems Review
Understanding fault-tolerant distributed systems
Communications of the ACM
Scalable load balancing techniques for parallel computers
Journal of Parallel and Distributed Computing
Impossibility of distributed consensus with one faulty process
Journal of the ACM (JACM)
On the impossibility of group membership
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Broadcast and gossip in line-communication mode
Discrete Applied Mathematics
The grid: blueprint for a new computing infrastructure
The grid: blueprint for a new computing infrastructure
What good are models and what models are good?
Distributed systems (2nd Ed.)
The Timed Asynchronous Distributed System Model
IEEE Transactions on Parallel and Distributed Systems
Fundamentals of fault-tolerant distributed computing in asynchronous environments
ACM Computing Surveys (CSUR)
Managing Checkpoints for Parallel Programs
IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
A Fault Detection Service for Wide Area Distributed Computations
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
ISIS: A System for Fault-Tolerant Distributed Computing
ISIS: A System for Fault-Tolerant Distributed Computing
Design and Performance of Horus: A Lightweight Group Communications System
Design and Performance of Horus: A Lightweight Group Communications System
A Gossip-Style Failure Detection Service
A Gossip-Style Failure Detection Service
Using Reflection for Incorporating Fault-Tolerance Techniques into Distributed Applications
Using Reflection for Incorporating Fault-Tolerance Techniques into Distributed Applications
GROUP MEMBERSHIP IN THE EPIDEMIC STYLE
GROUP MEMBERSHIP IN THE EPIDEMIC STYLE
Fault Tolerance for Cluster Computing Based on Functional Tasks
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Fast Branch & Bound Algorithms for Optimal Feature Selection
IEEE Transactions on Pattern Analysis and Machine Intelligence
A resource management and fault tolerance services in grid computing
Journal of Parallel and Distributed Computing - Special issue: Design and performance of networks for super-, cluster-, and grid-computing: Part II
Parallel Computing - Optimization on grids - Optimization for grids
An approach to grid resource selection and fault management based on ECA rules
Future Generation Computer Systems
Journal of Parallel and Distributed Computing
P2P computing for large tree exploration-based exact optimisation
International Journal of Grid and Utility Computing
Grid'BnB: a parallel branch and bound framework for grids
HiPC'07 Proceedings of the 14th international conference on High performance computing
Dynamic parallelization of grid–enabled web services
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
DELOS'04 Proceedings of the 6th Thematic conference on Peer-to-Peer, Grid, and Service-Orientation in Digital Library Architectures
Hierarchical branch and bound algorithm for computational grids
Future Generation Computer Systems
Hi-index | 0.00 |
The idle computers on a local area, campus area, or even wide area network represent a significant computational resource-one that is, however, also unreliable, heterogeneous, and opportunistic. We describe an algorithm that allows branch-and-bound problems to be solved in such environments. In designing this algorithm, we faced two challenges: (1) scalability, to effectively exploit the variably sized pools of resources available, and (2) fault tolerance, to ensure the reliability of services. We achieve scalability through a fully decentralized algorithm, in which the dynamically available resources are managed through a membership protocol. We guarantee fault tolerance in the sense that the loss of up to all but one resource will not affect the quality of the solution. For propagating information reliably, we use epidemic communication for both the membership protocol and the fault-tolerance mechanism. We have developed a simulation framework that allows us to evaluate design alternatives. Results obtained in this framework suggest that our techniques can execute scalably and reliably.