A first order approximation to the optimum checkpoint interval
Communications of the ACM
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
IEEE/ACM Transactions on Networking (TON)
Improving Performance via Computational Replication on a Large-Scale Computational Grid
CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Large-Scale Distributed Computational Fluid Dynamics on the Information Power Grid using Globus
FRONTIERS '99 Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation
HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Faults in Grids: Why are they so bad and What can be done about it?
GRID '03 Proceedings of the 4th International Workshop on Grid Computing
Adaptive incremental checkpointing for massively parallel systems
Proceedings of the 18th annual international conference on Supercomputing
WDM network design by ILP models based on flow aggregation
IEEE/ACM Transactions on Networking (TON)
Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids
IEEE Transactions on Parallel and Distributed Systems
Job demand models for optical grid research
ONDM'07 Proceedings of the 11th international IFIP TC6 conference on Optical network design and modeling
Design of the optical path layer in multiwavelength cross-connected networks
IEEE Journal on Selected Areas in Communications
Data-centric optical networks and their survivability
IEEE Journal on Selected Areas in Communications
1+1 protection of overlay distributed computing systems: modeling and optimization
ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part IV
A distributed framework for energy-efficient lightpaths in computational grids
Journal of High Speed Networks - Green Networking and Computing, Part 2 of 2
Hi-index | 0.24 |
Grids use a form of distributed computing to tackle complex computational and data processing problems scientists are presented with today. When designing an (optical) network supporting grids, it is essential that it can overcome single network failures, for which several protection schemes have been devised in the past. In this work, we extend the existing Shared Path protection scheme by incorporating the anycast principle typical of grids: a user typically does not care on what specific server this job gets executed and is merely interested in its timely delivery of results. Therefore, in contrast with Classical Shared Path protection (CSP), we will not necessarily provide a backup path between the source and the original destination. Instead, we allow to relocate the job to another server location if we can thus provide a backup path which comprises less wavelengths than the one CSP would suggest. We assess the bandwidth savings enabled by relocation in a quantitative dimensioning case study on an European and an American network topology, exhibiting substantial savings of the number of required wavelengths (in the order of 11-50%, depending on network topology and server locations). We also investigate how relocation affects the computational load on the execution servers. The case study is based on solving a grid network dimensioning problem: we present Integer Linear Programming (ILP) formulations for both the traditional CSP and the new resilience scheme exploiting relocation (SPR). We also outline a strategy to deal with the anycast principle: assuming we are given just the origins and intensity of job arrivals, we derive a static (source,destination)-based demand matrix. The latter is then used as input to solve the network dimensioning ILP for an optical circuit-switched WDM network.