A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models
PDP '11 Proceedings of the 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching
Proceedings of the 8th ACM International Conference on Computing Frontiers
Application-specific fault tolerance via data access characterization
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Hi-index | 0.00 |
Exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today's machines. Systems software for exascale machines must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massive data analysis in a highly unreliable hardware environment with billions of threads of execution. Further, these systems must be designed with failure in mind. FOX is a new system for the exascale that will support distributed data objects as first class objects in the operating system itself. This memory-based data store will be named and accessed as part of the file system name space of the application. We can build many types of objects with this data store, including data-driven work queues, which will in turn support applications with inherent resilience.