Distributed snapshots: determining global states of distributed systems

Authors:
K. Mani Chandy;Leslie Lamport
Affiliations:
Department of Computer Sciences, University of Texas at Austin, Austin, TX;Stanford Research Institute, Menlo Park, CA
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
1985

Citing 5
Cited 549

Distributed deadlock detection algorithm

ACM Transactions on Database Systems (TODS)
Termination Detection of Diffusing Computations in Communicating Sequential Processes

ACM Transactions on Programming Languages and Systems (TOPLAS)
Distributed deadlock detection

ACM Transactions on Computer Systems (TOCS)
Distributed computation on graphs: shortest path algorithms

Communications of the ACM
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Virtual time

ACM Transactions on Programming Languages and Systems (TOPLAS)
An example of stepwise refinement of distributed programs: quiescence detection

ACM Transactions on Programming Languages and Systems (TOPLAS) - The MIT Press scientific computation series
Highly available distributed services and fault-tolerant distributed garbage collection

PODC '86 Proceedings of the fifth annual ACM symposium on Principles of distributed computing
Debugging Parallel Programs with Instant Replay

IEEE Transactions on Computers
PARIS: a system for reusing partially interpreted schemas

ICSE '87 Proceedings of the 9th international conference on Software Engineering
Epidemic algorithms for replicated database maintenance

PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
Detecting global termination conditions in the face of uncertainty

PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
Detection of stable properties in distributed applications

PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
Interleaving set temporal logic

PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
Substituting for real time and common knowledge in asynchronous distributed systems

PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
Epidemic algorithms for replicated database maintenance

ACM SIGOPS Operating Systems Review
Deadlock detection in distributed databases

ACM Computing Surveys (CSUR)
Semantics based transaction management techniques for replicated data

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Debugging concurrent processes: a case study

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Monitoring and performance measuring distributed systems during operation

SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Toward a non-atomic era: l-exclusion as a test case

STOC '88 Proceedings of the twentieth annual ACM symposium on Theory of computing
Understanding and verifying distributed algorithms using stratified decomposition

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
The power of multimedia: combining point-to point and multi-access networks

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using asynchronous message logging and checkpointing

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Concurrent common knowledge: a new definition of agreement for asynchronous systems

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Detecting stable properties of networks in concurrent logic programming languages

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
On achieving consensus using a shared memory

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Reliability mechanisms for ADAMS

C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
A distributed debugger for Amoeba

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
A graphical representation of concurrent processes

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
The family of concurrent logic programming languages

ACM Computing Surveys (CSUR)
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
A compositional approach to superimposition

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Declarative visualization in the shared dataspace paradigm

ICSE '89 Proceedings of the 11th international conference on Software engineering
Debugging concurrent programs

ACM Computing Surveys (CSUR)
Distributed Checkpointing for Globally Consistent States of Databases

IEEE Transactions on Software Engineering
Knowledge and common knowledge in a distributed environment

Journal of the ACM (JACM)
Fault-tolerant computing based on Mach

ACM SIGOPS Operating Systems Review
Atomic snapshots of shared memory

PODC '90 Proceedings of the ninth annual ACM symposium on Principles of distributed computing
The inhibition spectrum and the achievement of causal consistency

PODC '90 Proceedings of the ninth annual ACM symposium on Principles of distributed computing
Self-stabilizing extensions for message-passing systems

PODC '90 Proceedings of the ninth annual ACM symposium on Principles of distributed computing
Mixed Programming Metaphors in a Shared Dataspace Model of Concurrency

IEEE Transactions on Software Engineering
The use of a synchronizer yields maximum computation rate in distributed networks

STOC '90 Proceedings of the twenty-second annual ACM symposium on Theory of computing
Paradigms for process interaction in distributed programs

ACM Computing Surveys (CSUR)
Replay, recovery, replication, and snapshots of nondeterministic concurrent programs

PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
Transparent optimistic rollback recovery

ACM SIGOPS Operating Systems Review
Restoring consistent global states of distributed computations

PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
An approach to reducing delays in recognizing distributed event occurrences

PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
Consistent detection of global predicates

PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
Elements for a course on the design of distributed algorithms

ACM SIGCSE Bulletin
The slide mechanism with applications in dynamic networks

PODC '92 Proceedings of the eleventh annual ACM symposium on Principles of distributed computing
An abstract model of rollback recovery control in distributed systems

ACM SIGOPS Operating Systems Review
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Self-stabilization

ACM Computing Surveys (CSUR)
Simulating synchronized clocks and common knowledge in distributed systems

Journal of the ACM (JACM)
The derivation of distributed termination detection algorithms from garbage collection schemes

ACM Transactions on Programming Languages and Systems (TOPLAS)
Atomic snapshots of shared memory

Journal of the ACM (JACM)
Causal controversy at Le Mont St.-Michel

ACM SIGOPS Operating Systems Review
Making parallel simulations go fast

WSC '92 Proceedings of the 24th conference on Winter simulation
A superimposition control construct for distributed systems

ACM Transactions on Programming Languages and Systems (TOPLAS)
Adaptive message logging for incremental replay of message-passing programs

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Detecting relational global predicates in distributed systems

PADD '93 Proceedings of the 1993 ACM/ONR workshop on Parallel and distributed debugging
Detecting atomic sequences of predicates in distributed computations

PADD '93 Proceedings of the 1993 ACM/ONR workshop on Parallel and distributed debugging
The pessimism behind optimistic simulation

PADS '94 Proceedings of the eighth workshop on Parallel and distributed simulation
Reliable and efficient hop-by-hop flow control

SIGCOMM '94 Proceedings of the conference on Communications architectures, protocols and applications
A distributed garbage collector for active objects

OOPSLA '94 Proceedings of the ninth annual conference on Object-oriented programming systems, language, and applications
ENF event predicate detection in distributed systems

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
A checkpoint protocol for an entry consistent shared memory system

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Self-stabilization by counter flushing

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Memory-efficient and self-stabilizing network RESET (extended abstract)

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Uniform actions in asynchronous distributed systems

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
On the memory overhead of distributed snapshots

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Local and temporal predicates in distributed systems

ACM Transactions on Programming Languages and Systems (TOPLAS)
An (N -1)-Resilient Algorithm for Distributed Termination Detection

IEEE Transactions on Parallel and Distributed Systems
Concurrent and Distributed Garbage Collection of Active Objects

IEEE Transactions on Parallel and Distributed Systems
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.

IEEE Transactions on Parallel and Distributed Systems
Testing and Debugging Distributed Programs Using Global Predicates

IEEE Transactions on Software Engineering
Online tracking of mobile users

Journal of the ACM (JACM)
Detection and resolution of deadlocks in distributed database systems

CIKM '95 Proceedings of the fourth international conference on Information and knowledge management
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
On distributed object checkpointing and recovery

Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
On the relevance of communication costs of rollback-recovery protocols

Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Reasoning about meta level activities in open distributed systems

Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Finite termination of asynchronous iterative algorithms

Parallel Computing
Indirect distributed garbage collection: handling object migration

ACM Transactions on Programming Languages and Systems (TOPLAS)
An online computation of critical path profiling

SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Debugging race conditions in message-passing programs

SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Adaptive recovery for mobile environments

Communications of the ACM
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors

IEEE Transactions on Computers
Detection of Strong Unstable Predicates in Distributed Programs

IEEE Transactions on Parallel and Distributed Systems
Trade-offs in implementing causal message logging protocols

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
A reliable and scalable striping protocol

Conference proceedings on Applications, technologies, architectures, and protocols for computer communications
Optimistic Crash Recovery without Changing Application Messages

IEEE Transactions on Parallel and Distributed Systems
Group membership and view synchrony in partitionable asynchronous distributed systems: specifications

ACM SIGOPS Operating Systems Review
Distributed termination detection for dynamic systems

Parallel Computing
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Distributed deadlock detection in Ada run-time environments

TRI-Ada '90 Proceedings of the conference on TRI-ADA '90
An algorithm for message delivery to mobile units

PODC '97 Proceedings of the sixteenth annual ACM symposium on Principles of distributed computing
A Survey of Distributed Database Checkpointing

Distributed and Parallel Databases
An effective garbage collection strategy for parallel programming languages on large scale distributed-memory machines

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Protocols for Integrity Constraint Checking in FederatedDatabases

Distributed and Parallel Databases
A Survey of Recoverable Distributed Shared Virtual Memory Systems

IEEE Transactions on Parallel and Distributed Systems
Progressive Retry for Software Failure Recovery in Message-Passing Applications

IEEE Transactions on Computers
Efficient transparent application recovery in client-server information systems

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A geographically distributed framework for embedded system design and validation

DAC '98 Proceedings of the 35th annual Design Automation Conference
Persistent messages in local transactions

PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
Fault-tolerant distributed simulation

PADS '98 Proceedings of the twelfth workshop on Parallel and distributed simulation
A Case for Two-Level Recovery Schemes

IEEE Transactions on Computers
Webs of Archived Distributed Computations for Asynchronous Collaboration

The Journal of Supercomputing - Special issue: high performance distributed computing
Efficient and flexible fault tolerance and migration of scientific simulations using CUMULVS

SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability

IEEE Transactions on Parallel and Distributed Systems
Critical Path Profiling of Message Passing and Shared-Memory Programs

IEEE Transactions on Parallel and Distributed Systems
On Coordinated Checkpointing in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Transparent adaptive parallelism on NOWs using OpenMP

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Rollback-dependency trackability: visible characterizations

Proceedings of the eighteenth annual ACM symposium on Principles of distributed computing
Optimism: not just for event execution anymore

PADS '99 Proceedings of the thirteenth workshop on Parallel and distributed simulation
SFT: a consistent checkpointing algorithm with shorter freezing time

ACM SIGOPS Operating Systems Review
Algorithm development in the mobile environment

Proceedings of the 21st international conference on Software engineering
Fault-tolerant distributed simulation

WSC '91 Proceedings of the 23rd conference on Winter simulation
Learning to Improve Coordinated Actions in Cooperative Distributed Problem-Solving Environments

Machine Learning
Event-Based Techniques to Debug an Object Request Broker

The Journal of Supercomputing
Staggered Consistent Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Communication-Induced Determination of Consistent Snapshots

IEEE Transactions on Parallel and Distributed Systems
A module on distributed systems for the operating systems course

SIGCSE '90 Proceedings of the twenty-first SIGCSE technical symposium on Computer science education
Checkpointing and rollback-recovery for distributed systems

ACM '86 Proceedings of 1986 ACM Fall joint computer conference
An architecture for packet-striping protocols

ACM Transactions on Computer Systems (TOCS)
Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

The Journal of Supercomputing
A Low Overhead Logging Scheme for Fast Recovery in Distributed Shared Memory Systems

The Journal of Supercomputing
Debugging distributed programs using controlled re-execution

Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
Resettable vector clocks

Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Increasing the confidence in off-the-shelf components: a software connector-based approach

SSR '01 Proceedings of the 2001 symposium on Software reusability: putting software reuse in context
The concurrency hierarchy, and algorithms for unbounded concurrency

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Techniques to Tackle State Explosion in Global Predicate Detection

IEEE Transactions on Software Engineering
Transparent optimistic rollback recovery

EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Causality in distributed systems

EW 5 Proceedings of the 5th workshop on ACM SIGOPS European workshop: Models and paradigms for distributed systems structuring
Distributed Predicate Detection in Series-Parallel Systems

IEEE Transactions on Parallel and Distributed Systems
Highly efficient gang scheduling implementation

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A checkpoint-based high availability run-time system for Windows NT clusters

ACM SIGOPS Operating Systems Review
Using passive object garbage collection algorithms for garbage collection of active objects

Proceedings of the 3rd international symposium on Memory management
A Formal Specification and Verification Framework for Time Warp-Based Parallel Simulation

IEEE Transactions on Software Engineering
Tracking immediate predecessors in distributed computations

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
A Roll-Forward Recovery Scheme for Solving the Problem of Coasting Forward for Distributed Systems

ACM SIGOPS Operating Systems Review
Logical Clock Requirements for Reverse Engineering Scenarios from a Distributed System

IEEE Transactions on Software Engineering
A Distributed Parallel Programming Framework

IEEE Transactions on Software Engineering
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Reliable network connections

Proceedings of the 8th annual international conference on Mobile computing and networking
On-the-fly calculation and verification of consistent steering transactions

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Undo as concurrent inverse in group editors

ACM Transactions on Computer-Human Interaction (TOCHI)
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing

The Journal of Supercomputing
Triggered message sequence charts

Proceedings of the 10th ACM SIGSOFT symposium on Foundations of software engineering
Concurrent single stepping in event-visualization tools

Cluster Computing
Optimal Distributed Arc-Consistency

Constraints
Triggered message sequence charts

ACM SIGSOFT Software Engineering Notes
Current Approaches for Solving Over-Constrained Problems

Constraints
Adaptive Message Logging for Incremental Program Replay

IEEE Parallel & Distributed Technology: Systems & Technology
Bounded and Minimum Global Snapshots

IEEE Parallel & Distributed Technology: Systems & Technology
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
Methods for Observing Global Properties in Distributed Systems

IEEE Parallel & Distributed Technology: Systems & Technology
A Framework for Distributed Debugging

IEEE Software
Reliability Through Consistency

IEEE Software
Nest: A Nested-Predicate Scheme for Fault Tolerance

IEEE Transactions on Computers
Distributed Reset

IEEE Transactions on Computers
An Adaptive Checkpointing Scheme for Distributed Databases with Mixed Types of Transactions

IEEE Transactions on Knowledge and Data Engineering
Development of a Class of Distributed Termination Detection Algorithms

IEEE Transactions on Knowledge and Data Engineering
The Distributed Constraint Satisfaction Problem: Formalization and Algorithms

IEEE Transactions on Knowledge and Data Engineering
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks

IEEE Transactions on Parallel and Distributed Systems
Checkpointing for Distributed Databases: Starting from the Basics

IEEE Transactions on Parallel and Distributed Systems
An Implementation of F-Channels

IEEE Transactions on Parallel and Distributed Systems
An Efficient Protocol for Checkpointing Recovery in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Detection of Weak Unstable Predicates in Distributed Programs

IEEE Transactions on Parallel and Distributed Systems
Repeated Computation of Global Functions in a Distributed Environment

IEEE Transactions on Parallel and Distributed Systems
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
On the Performance of Synchronized Programs in Distributed Networks with Random Processing Times and Transmission Delays

IEEE Transactions on Parallel and Distributed Systems
Efficient Rollback-Recovery Technique in Distributed Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Finding Consistent Global Checkpoints in a Distributed Computation

IEEE Transactions on Parallel and Distributed Systems
Proof Rules for Flush Channels

IEEE Transactions on Software Engineering
Passive-Space and Time View: Vector Clocks for Achieving Higher Performance, Program Correction, and Distributed Computing

IEEE Transactions on Software Engineering
Efficient Detection and Resolution of Generalized Distributed Deadlocks

IEEE Transactions on Software Engineering
Consistency Issues in Distributed Checkpoints

IEEE Transactions on Software Engineering
An Efficient Distributed Online Algorithm to Detect Strong Conjunctive Predicates

IEEE Transactions on Software Engineering
Checkpointing with mutable checkpoints

Theoretical Computer Science - Dependable computing
Bounded time-stamping in message-passing systems

Theoretical Computer Science
Local stabilizer

Journal of Parallel and Distributed Computing - Self-stabilizing distributed systems
Interval consistency of asynchronous distributed computations

Journal of Computer and System Sciences
Perfect Failure Detection in Timed Asynchronous Systems

IEEE Transactions on Computers
An Experimental Evaluation of Coordinated Checkpointing in a Parallel Machine

EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
Detection of Orthogonal Interval Relations

HiPC '02 Proceedings of the 9th International Conference on High Performance Computing
Performance Evaluation of Fault Tolerance for Parallel Applications in Networked Environments

ICPP '97 Proceedings of the international Conference on Parallel Processing
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Interactive Visual Exploration of Distributed Computations

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Detecting Temporal Logic Predicates on the Happened-Before Model

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Causality Filters: A Tool for the Online Visualization and Steering of Parallel and Distributed Programs

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing in Message Passing Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Checkpointing and Rollback of Wide-area Distributed Applications using Mobile Agents

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
FANTOMAS: Fault Tolerance for Mobile Agents in Clusters

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
QoS based Checkpoint Protocol in Multimedia Network Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Concurrent Reading and Writing with Mobile Agents

IWDC '02 Proceedings of the 4th International Workshop on Distributed Computing, Mobile and Wireless Computing
Two Epoch Algorithms for Disaster Recovery

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
On the Complexity of the Minimum and Maximum Global Snapshot Problems

COMPSAC '97 Proceedings of the 21st International Computer Software and Applications Conference
Computation Slicing: Techniques and Theory

DISC '01 Proceedings of the 15th International Conference on Distributed Computing
Guaranteed Mutually Consistent Checkpointing in Distributed Computations

ASIAN '98 Proceedings of the 4th Asian Computing Science Conference on Advances in Computing Science
Distributed Checkpointing on Clusters with Dynamic Striping and Staggering

ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
Shortcut Replay: A Replay Technique for Debugging Long-Running Parallel Programs

ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
An Efficient Coordinated Checkpointing Scheme Based on PWD Model

ICOIN '02 Revised Papers from the International Conference on Information Networking, Wireless Communications Technologies and Network Applications-Part II
A Hybrid Fault-Tolerant Scheme Based on Checkpointing in MASs

ICOIN '02 Revised Papers from the International Conference on Information Networking, Wireless Communications Technologies and Network Applications-Part II
A Structural Embedding of Ocsid in PVS

TPHOLs '01 Proceedings of the 14th International Conference on Theorem Proving in Higher Order Logics
Instant Image: Transitive and Cyclical Snapshots in Distributed Storage Volumes

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Universal Constructs in Distributed Computations

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Agents, Distributed Algorithms, and Stabilization

COCOON '00 Proceedings of the 6th Annual International Conference on Computing and Combinatorics
Distributed Configuration as Distributed Dynamic Constraint Satisfaction

Proceedings of the 14th International conference on Industrial and engineering applications of artificial intelligence and expert systems: engineering of intelligent systems
Checkpoint-Recovery for Mobile Intelligent Networks

Proceedings of the 14th International conference on Industrial and engineering applications of artificial intelligence and expert systems: engineering of intelligent systems
Keeping Track of the Latest Gossip in Shared Memory Systems

FST TCS 2000 Proceedings of the 20th Conference on Foundations of Software Technology and Theoretical Computer Science
Concurrent Knowledge and Logical Clock Abstractions

FST TCS 2000 Proceedings of the 20th Conference on Foundations of Software Technology and Theoretical Computer Science
Distributed Reinforcement of Arc-Consistency

PRICAI '02 Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Design Evolution of the EROS Single-Level Store

ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
Algorithm Visualization For Distributed Environments

INFOVIS '98 Proceedings of the 1998 IEEE Symposium on Information Visualization
(Im)Possibilities of Predicate Detection in Crash-Affected Systems

WSS '01 Proceedings of the 5th International Workshop on Self-Stabilizing Systems
Recent Advances in Distributed Garbage Collection

Advances in Distributed Systems, Advanced Distributed Computing: From Algorithms to Systems
Termination Detection of Distributed Algorithms by Graph Relabelling Systems

ICGT '02 Proceedings of the First International Conference on Graph Transformation
Mechanizing Proofs of Computation Equivalence

CAV '99 Proceedings of the 11th International Conference on Computer Aided Verification
A Fault-Tolerant Scheme of Multi-agent System for Worker Agents

AMT '01 Proceedings of the 6th International Computer Science Conference on Active Media Technology
Synergistic Coordination between Software and Hardware Fault Tolerance Techniques

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Extending PVM with Consistent Cut Capabilities: Application Aspects and Implementation Strategies

Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
QoS-Based Checkpoint Protocol for Multimedia Network Systems

PCM '01 Proceedings of the Second IEEE Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
Protocol for Taking Object-Based Checkpoints

DEXA '00 Proceedings of the 11th International Conference on Database and Expert Systems Applications
Deadlock detection in distributed database systems: a new algorithm and a comparative performance analysis

The VLDB Journal — The International Journal on Very Large Data Bases
An Efficient Optimistic Message Logging Scheme for Recoverable Mobile Computing Systems

IEEE Transactions on Mobile Computing
An efficient causal logging scheme for recoverable distributed shared memory systems

Parallel Computing
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Single stepping in event-visualization tools

CASCON '96 Proceedings of the 1996 conference of the Centre for Advanced Studies on Collaborative research
Collective operations in application-level fault-tolerant MPI

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Debugging in a Distributed World: Observation and Control

ASSET '98 Proceedings of the 1998 IEEE Workshop on Application - Specific Software Engineering and Technology
A Fair Fast Distributed Concurrent-Reader Exclusive-Writer Synchronization

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Supporting fault-tolerance in heterogeneous distributed applications

HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
A world-wide distributed system using Java and the Internet

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Concurrent rollback for crash recovery in extended hypercube networks

PAS '95 Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
An Efficient Checkpointing Algorithm for Distributed Systems Implementing Reliable Communication Channels

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Optimistic Recovery in Multi-Threaded Distributed Systems

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Object-Based Checkpoints in Distributed Systems

WORDS '97 Proceedings of the 3rd Workshop on Object-Oriented Real-Time Dependable Systems - (WORDS '97)
Checkpoint and Rollback in Asynchronous Distributed Systems

INFOCOM '97 Proceedings of the INFOCOM '97. Sixteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Driving the Information Revolution
User-Triggered Checkpointing: System-Independent and Scalable Application Recovery

ISCC '97 Proceedings of the 2nd IEEE Symposium on Computers and Communications (ISCC '97)
Termination detection in data-driven parallel computations/applications

Journal of Parallel and Distributed Computing
Evaluating Distributed Checkpointing Protocol

ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Enabling Snap-Stabilization

ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
User-level checkpointing through exportable kernel state

IWOOOS '96 Proceedings of the 5th International Workshop on Object Orientation in Operating Systems (IWOOOS '96)
An Exercise in Formal Reasoning about Mobile Communications

IWSSD '98 Proceedings of the 9th international workshop on Software specification and design
A Mechanized Proof Environment for the Convenient Computations Proof Method

Formal Methods in System Design
Error detection in large-scale parallel programs with long runtimes

Future Generation Computer Systems - Tools for program development and analysis
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Fault Tolerance for Off-the-Shelf Applications and Hardware

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
On Detecting Global Predicates in Distributed Computations

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Self-Stabilizing PIF Algorithm in Arbitrary Rooted Networks

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Design and Implementation of a Composable Reflective Middleware Framework

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
A Protocol Design of Communication State Transfer for Distributed Computing

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
On Slicing a Distributed Computation

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Enforcing Perfect Failure Detection

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Predicate Control for Active Debugging of Distributed Programs

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
An algorithm for Supporting Fault Tolerant Objects in Distributed Object-Oriented Operating Systems

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Checkpointing and Recovery for Distributed Shared Memory Applications

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
A Fine-Grained Modality Classification for Global Predicates

IEEE Transactions on Parallel and Distributed Systems
On Properties of RDT Communication-Induced Checkpointing Protocols

IEEE Transactions on Parallel and Distributed Systems
ACM SIGACT News distributed computing column 12

ACM SIGACT News
Granularity-Driven Dynamic Predicate Slicing Algorithms for Message Passing Systems

Automated Software Engineering
Distributed recovery with K-optimistic logging

Journal of Parallel and Distributed Computing
Causality tracking in causal message-logging protocols

Distributed Computing
Action systems in incremental and aspect-oriented modeling

Distributed Computing - Papers in celebration of the 20th anniversary of PODC
On designing direct dependency: based fast recovery algorithms for distributed systems

ACM SIGOPS Operating Systems Review
Finding a Recovery Line in Uncoordinated Checkpointing

ICDCSW '04 Proceedings of the 24th International Conference on Distributed Computing Systems Workshops - W7: EC (ICDCSW'04) - Volume 7
Predicate control: synchronization in distributed computations with look-ahead

Journal of Parallel and Distributed Computing
Energy-aware deterministic fault tolerance in distributed real-time embedded systems

Proceedings of the 41st annual Design Automation Conference
A Global-State-Triggered Fault Injector for Distributed System Evaluation

IEEE Transactions on Parallel and Distributed Systems
Quantifying rollback propagation in distributed checkpointing

Journal of Parallel and Distributed Computing
A causal message logging protocol for mobile nodes in mobile computing systems

Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Communication State Transfer for the Mobility of Concurrent Heterogeneous Computing

IEEE Transactions on Computers
Fast, Centralized Detection and Resolution of Distributed Deadlocks in the Generalized Model

IEEE Transactions on Software Engineering
Agent-Based Approach to Dynamic Meeting Scheduling Problems

AAMAS '04 Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems - Volume 3
Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks

Journal of Parallel and Distributed Computing
Application-level checkpointing for shared memory programs

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
A link between knowledge and communication in faulty distributed systems

TARK '90 Proceedings of the 3rd conference on Theoretical aspects of reasoning about knowledge
A knowledge theoretic account of recovery in distributed systems: the case of negotiated commitment

TARK '88 Proceedings of the 2nd conference on Theoretical aspects of reasoning about knowledge
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery

IEEE Transactions on Dependable and Secure Computing
PDB: Pervasive Debugging With Xen

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Checkpoint and Restart for Distributed Components in XCAT3

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
A Termination Detection Protocol for Use in Mobile Ad Hoc Networks

Automated Software Engineering
Communication-based prevention of useless checkpoints in distributed computations

Distributed Computing
Constraint-based structuring of network protocols

Distributed Computing
Detection of global predicates: techniques and their limitations

Distributed Computing
Extensible, Scalable Monitoring for Clusters of Computers

LISA '97 Proceedings of the 11th USENIX conference on System administration
Finding missing synchronization in a distributed computation using controlled re-execution

Distributed Computing
The power of logical clock abstractions

Distributed Computing
Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A Study of Various Load Information Exchange Mechanisms for a Distributed Application using Dynamic Scheduling

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Event Logging: Portable and Efficient Checkpointing in Heterogeneous Environments with Non-FIFO Communication Platforms

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
Optimizing Checkpoint Sizes in the C3 System

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
A novel min-process checkpointing scheme for mobile computing systems

Journal of Systems Architecture: the EUROMICRO Journal
Safety assurance via on-line monitoring

Distributed Computing
Synchronous, asynchronous, and causally ordered communication

Distributed Computing
Efficient detection of a class of stable properties

Distributed Computing
Strong stable properties in distributed systems

Distributed Computing
Efficient algorithms for optimistic crash recovery

Distributed Computing
Concurrent common knowledge: defining agreement for asynchronous systems

Distributed Computing
Verification of distributed programs using representative interleaving sequences

Distributed Computing
µsik " A Micro-Kernel for Parallel/Distributed Simulation Systems

Proceedings of the 19th Workshop on Principles of Advanced and Distributed Simulation
Self-stabilizing extensions for message-passing systems

Distributed Computing - Special issue: Self-stabilization
The inhibition spectrum and the achievement of causal consistency

Distributed Computing
Towards the construction of distributed detection programs, with an application to distributed termination

Distributed Computing
Detecting causal relationships in distributed computations: in search of the holy grail

Distributed Computing
Intractability results in predicate detection

Information Processing Letters
On deadlocks of exclusive AND-requests for resources

Distributed Computing
Fault tolerance for internet agent systems: in cases of stop failure and byzantine failure

Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems
On the design of a pervasive debugger

Proceedings of the sixth international symposium on Automated analysis-driven debugging
A channel memory based fault tolerance for MPI applications

Future Generation Computer Systems - Special issue: Parallel computing technologies
Causality-Based Predicate Detection across Space and Time

IEEE Transactions on Computers
Using Consistent Global Checkpoints to Synchronize Processes in Distributed Simulation

DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications
Event-based Programming Models for Event-based Programming Models for

DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications
An Efficient Index-Based Checkpointing Protocol with Constant-Size Control Information on Messages

IEEE Transactions on Dependable and Secure Computing
A visual environment for distributed simulation systems

ACM SIGSIM Simulation Digest
Asynchronous backtracking without adding links: a new member in the ABT family

Artificial Intelligence - Special issue: Distributed constraint satisfaction
Asynchronous aggregation and consistency in distributed constraint satisfaction

Artificial Intelligence - Special issue: Distributed constraint satisfaction
Meetings scheduling solver enhancement with local consistency reinforcement

Applied Intelligence
Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++

ACM SIGOPS Operating Systems Review
Performance analysis of different checkpointing and recovery schemes using stochastic model

Journal of Parallel and Distributed Computing
Resettable vector clocks

Journal of Parallel and Distributed Computing
Finding a suitable checkpoint and recovery protocol for a distributed application

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Fast batched data transfer with flush channels: A performance analysis

Journal of Parallel and Distributed Computing
Techniques and applications of computation slicing

Distributed Computing
Manufacturing opaque predicates in distributed systems for code obfuscation

ACSC '06 Proceedings of the 29th Australasian Computer Science Conference - Volume 48
Design, Analysis and Performance Evaluation of a New Algorithm for Developing a Fault Tolerant Distributed System

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Cyclic Storage for Fault-Tolerant Distributed Executions

IEEE Transactions on Parallel and Distributed Systems
Detecting and Isolating Malicious Routers

IEEE Transactions on Dependable and Secure Computing
Safety and consistency in policy-based authorization systems

Proceedings of the 13th ACM conference on Computer and communications security
Experimental evaluation of application-level checkpointing for OpenMP programs

Proceedings of the 20th annual international conference on Supercomputing
Scalable algorithms for global snapshots in distributed systems

Proceedings of the 20th annual international conference on Supercomputing
Realizing the e-science desktop peer using a peer-to-peer distributed virtual machine middleware

Proceedings of the 4th international workshop on Middleware for grid computing
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Using queries for distributed monitoring and forensics

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Declarative failure recovery for sensor networks

Proceedings of the 6th international conference on Aspect-oriented software development
Quasi-atomic recovery for distributed agents

Parallel Computing
An efficient reliable broadcasting protocol for wireless mobile ad hoc networks

Ad Hoc Networks
Efficient detection of a locally stable predicate in a distributed system

Journal of Parallel and Distributed Computing
MPI implementation of parallel subdomain methods for linear and nonlinear convection--diffusion problems

Journal of Parallel and Distributed Computing
Peer-to-Peer and fault-tolerance: Towards deployment-based technical services

Future Generation Computer Systems
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Formal Verification of Simulation Traces Using Computation Slicing

IEEE Transactions on Computers
On the Complexity of Removing Z-Cycles from a Checkpoints and Communication Pattern

IEEE Transactions on Computers
Detecting Arbitrary Stable Properties Using Efficient Snapshots

IEEE Transactions on Software Engineering
Self-stabilizing algorithm for checkpointing in a distributed system

Journal of Parallel and Distributed Computing
Lightweight cnsistency enforcement schemes for distributed proofs with hidden subtrees

Proceedings of the 12th ACM symposium on Access control models and technologies
Object caching in a CORBA compliant system

COOTS'96 Proceedings of the 2nd conference on USENIX Conference on Object-Oriented Technologies (COOTS) - Volume 2
Transparent fault tolerance for parallel applications on networks of workstations

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Testing Dynamic Adaptation in Distributed Systems

AST '07 Proceedings of the Second International Workshop on Automation of Software Test
An agent-based approach to solve dynamic meeting scheduling problems with preferences

Engineering Applications of Artificial Intelligence
An efficient delay-optimal distributed termination detection algorithm

Journal of Parallel and Distributed Computing
Modeling and design of fault-tolerant and self-adaptive reconfigurable networked embedded systems

EURASIP Journal on Embedded Systems
An enhanced model-based checkpointing protocol

PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
Transactions with isolation and cooperation

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Temporal Predicate Detection Using Synchronized Clocks

IEEE Transactions on Computers
Solving Computation Slicing Using Predicate Detection

IEEE Transactions on Parallel and Distributed Systems
Towards distributed service provisioning

Proceedings of the 6th international conference on Mobile and ubiquitous multimedia
A Lightweight Heuristic-based Mechanism for Collecting Committed Consistent Global States in Optimistic Simulation

DS-RT '07 Proceedings of the 11th IEEE International Symposium on Distributed Simulation and Real-Time Applications
Distributed Watchpoints: Debugging Large Modular Robot Systems

International Journal of Robotics Research
Model-based performance evaluation of distributed checkpointing protocols

Performance Evaluation
A synchronous checkpointing protocol for mobile distributed systems: probabilistic approach

International Journal of Information and Computer Security
Coordinated checkpoint versus message log for fault tolerant MPI

International Journal of High Performance Computing and Networking
Data sharing vs. message passing: synergy or incompatibility?: an implementation-driven case study

Proceedings of the 2008 ACM symposium on Applied computing
Transparent checkpoint-restart of multiple processes on commodity operating systems

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Time in State Machines

Fundamenta Informaticae - This is a SPECIAL ISSUE ON ASM'05
On termination detection in crash-prone distributed systems with failure detectors

Journal of Parallel and Distributed Computing
Data-stream-based global event monitoring using pairwise interactions

Journal of Parallel and Distributed Computing
Tracking in a spaghetti bowl: monitoring transactions using footprints

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A low-cost hybrid coordinated checkpointing protocol for mobile distributed systems

Mobile Information Systems
Communication analysis of distributed programs

Scientific Programming - Parallel/High-Performance Object-Oriented Scientific Computing (POOSC '05), Glasgow, UK, 25 July 2005
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
A new class of nature-inspired algorithms for self-adaptive peer-to-peer computing

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
Applying static network protocols to dynamic networks

SFCS '87 Proceedings of the 28th Annual Symposium on Foundations of Computer Science
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
Consensus routing: the internet as a distributed system

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Optimal maintenance of a spanning tree

Journal of the ACM (JACM)
2-step algorithm for enhancing effectiveness of sender-based message logging

SpringSim '07 Proceedings of the 2007 spring simulation multiconference - Volume 2
Taking snapshots of virtual networked environments

VTDC '07 Proceedings of the 2nd international workshop on Virtualization technology in distributed computing
Testing Distributed Systems Through Symbolic Model Checking

FORTE '07 Proceedings of the 27th IFIP WG 6.1 international conference on Formal Techniques for Networked and Distributed Systems
ModHel'X: A Component-Oriented Approach to Multi-Formalism Modeling

Models in Software Engineering
Distributed Semantics and Implementation for Systems with Interaction and Priority

FORTE '08 Proceedings of the 28th IFIP WG 6.1 international conference on Formal Techniques for Networked and Distributed Systems
Enforcing Safety and Consistency Constraints in Policy-Based Authorization Systems

ACM Transactions on Information and System Security (TISSEC)
Lightweight log management algorithm for removing logged messages of sender processes with little overhead

WSEAS Transactions on Computers
An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Journal of Parallel and Distributed Computing
Empire of colonies: Self-stabilizing and self-organizing distributed algorithm

Theoretical Computer Science
Sensornet Checkpointing: Enabling Repeatability in Testbeds and Realism in Simulations

EWSN '09 Proceedings of the 6th European Conference on Wireless Sensor Networks
Transparent checkpoints of closed distributed systems in Emulab

Proceedings of the 4th ACM European conference on Computer systems
Distributed constraint satisfaction with partially known constraints

Constraints
A self-protecting and self-healing framework for negotiating services and trust in autonomic communication systems

Computer Networks: The International Journal of Computer and Telecommunications Networking
Interconnect agnostic checkpoint/restart in open MPI

Proceedings of the 18th ACM international symposium on High performance distributed computing
CrystalBall: predicting and preventing inconsistencies in deployed distributed systems

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
NetReview: detecting when interdomain routing goes wrong

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Distributed Log-based Reconciliation

Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
Transparent parallel checkpointing and migration in clusters and ClusterGrids

International Journal of Computational Science and Engineering
A Snapshot Algorithm for Mobile Ad Hoc Networks

IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part II: Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living
Measurement and modeling of a large-scale overlay for multimedia streaming

The Fourth International Conference on Heterogeneous Networking for Quality, Reliability, Security and Robustness & Workshops
Brief announcement: virtual world consistency: a new condition for STM systems

Proceedings of the 28th ACM symposium on Principles of distributed computing
A novel low-overhead recovery approach for distributed systems

Journal of Computer Systems, Networks, and Communications
Demo abstract: Sensornet checkpointing between simulated and deployed networks

IPSN '09 Proceedings of the 2009 International Conference on Information Processing in Sensor Networks
Locally Distributed Predicates: A Programming Facility for Distributed State Detection

ICLP '09 Proceedings of the 25th International Conference on Logic Programming
Efficient model checking for LTL with partial order snapshots

Theoretical Computer Science
Toward Exascale Resilience

International Journal of High Performance Computing Applications
An autonomous agent approach to query optimization in stream grids

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
Macrodebugging: global views of distributed program execution

Proceedings of the 7th ACM Conference on Embedded Networked Sensor Systems
Asynchronous backtracking without adding links: a new member in the ABT family

Artificial Intelligence - Special issue: Distributed constraint satisfaction
Asynchronous aggregation and consistency in distributed constraint satisfaction

Artificial Intelligence - Special issue: Distributed constraint satisfaction
Scalable temporal order analysis for large scale debugging

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Application and middleware transparent checkpointing with TCKPT on ClusterGrids

Future Generation Computer Systems
SBDO: A New Robust Approach to Dynamic Distributed Constraint Optimisation

PRIMA '09 Proceedings of the 12th International Conference on Principles of Practice in Multi-Agent Systems
A Channel Memory based fault tolerance for MPI applications

Future Generation Computer Systems - Special issue: Parallel computing technologies
Intractability results in predicate detection

Information Processing Letters
Predicting and preventing inconsistencies in deployed distributed systems

ACM Transactions on Computer Systems (TOCS)
A tale of two planners: modular robotic planning with LDP

IROS'09 Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems
A weighted checkpointing protocol for mobile distributed systems

International Journal of Ad Hoc and Ubiquitous Computing
'Conceptual distance' and interface-supported visualization of information objects and patterns

Journal of Visual Languages and Computing
Recovery oriented programming

SSS'06 Proceedings of the 8th international conference on Stabilization, safety, and security of distributed systems
An automata-based approach to property testing in event traces

TestCom'03 Proceedings of the 15th IFIP international conference on Testing of communicating systems
ROS: the rollback-one-step method to minimize the waiting time during debugging long-running parallel programs

VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Improving dependability of component-based systems via multi-versioning connectors

Architecting dependable systems
Parametric and sliced causality

CAV'07 Proceedings of the 19th international conference on Computer aided verification
Distributed forward checking may lie for privacy

CSCLP'06 Proceedings of the constraint solving and contraint logic programming 11th annual ERCIM international conference on Recent advances in constraints
Distance sensitive snapshots in wireless sensor networks

OPODIS'07 Proceedings of the 11th international conference on Principles of distributed systems
Asynchronous inter-level forward-checking for DisCSPs

CP'09 Proceedings of the 15th international conference on Principles and practice of constraint programming
Help when needed, but no more: efficient read/write partial snapshot

DISC'09 Proceedings of the 23rd international conference on Distributed computing
Co-ordination in artificial agent societies: social structures and its implications for autonomous problem-solving agents

Co-ordination in artificial agent societies: social structures and its implications for autonomous problem-solving agents
A general method to make multi-clock system deterministic

Proceedings of the Conference on Design, Automation and Test in Europe
A flexible checkpoint/restart model in distributed systems

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Communicating transactions

CONCUR'10 Proceedings of the 21st international conference on Concurrency theory
Designing execution control in programs with global application states monitoring

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part II
Checkpoint/restart-enabled parallel debugging

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Recent advances in checkpoint/recovery systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Plan switching: an approach to plan execution in changing environments

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Modeling and analyzing periodic distributed computations

SSS'10 Proceedings of the 12th international conference on Stabilization, safety, and security of distributed systems
Safe flocking in spite of actuator faults

SSS'10 Proceedings of the 12th international conference on Stabilization, safety, and security of distributed systems
Aspect-oriented checkpointing approach of composed web services

ICWE'10 Proceedings of the 10th international conference on Current trends in web engineering
Self-stabilizing Byzantine asynchronous unison

OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
Peers-for-peers (P4P): an efficient and reliable fault-tolerance strategy for cycle-stealing P2P applications

International Journal of Communication Networks and Distributed Systems
Reliable distributed data stream management in mobile environments

Information Systems
Collective assertions

VMCAI'11 Proceedings of the 12th international conference on Verification, model checking, and abstract interpretation
Revisiting and improving a result on integrity preservation by concurrent transactions

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Macro and micro context-awareness for autonomic pervasive computing

Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
Detecting Locally Distributed Predicates

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
A hybrid fault tolerance technique in grid computing system

The Journal of Supercomputing
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

ACM Transactions on Architecture and Code Optimization (TACO)
New & efficient low overheads algorithm for mobile distributed systems

Proceedings of the International Conference & Workshop on Emerging Trends in Technology
New & efficient low overheads algorithm for mobile distributed systems

Proceedings of the International Conference & Workshop on Emerging Trends in Technology
Concurrency among strangers: programming in E as plan coordination

TGC'05 Proceedings of the 1st international conference on Trustworthy global computing
Fast checkpoint recovery algorithms for frequently consistent applications

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Trebuchet: exploring TLP with dataflow virtualisation

International Journal of High Performance Systems Architecture
Toward generating reducible replay logs

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
A decentralized deadlock detection and resolution algorithm for generalized model in distributed systems

Distributed and Parallel Databases
Boosting distributed constraint satisfaction

Journal of Heuristics
ScatterD: Spatial deployment optimization with hybrid heuristic/evolutionary algorithms

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
Monitoring distributed systems using knowledge

FMOODS'11/FORTE'11 Proceedings of the joint 13th IFIP WG 6.1 and 30th IFIP WG 6.1 international conference on Formal techniques for distributed systems
Correlated set coordination in fault tolerant message logging protocols

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Distributed constraint programming with agents

ICAIS'11 Proceedings of the Second international conference on Adaptive and intelligent systems
Brief announcement: a concurrent partial snapshot algorithm for large-scale and dynamic distributed systems

SSS'11 Proceedings of the 13th international conference on Stabilization, safety, and security of distributed systems
Help when needed, but no more: Efficient read/write partial snapshot

Journal of Parallel and Distributed Computing
A global snapshot collection algorithm with concurrent initiators with non-FIFO channel

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
Distributed implementation of systems with multiparty interactions and priorities

SEFM'11 Proceedings of the 9th international conference on Software engineering and formal methods
Parallel solution of the obstacle problem in Grid environments

International Journal of High Performance Computing Applications
A proxy based efficient checkpointing scheme for fault recovery in mobile grid system

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Empire of Colonies: self-stabilizing and self-organizing distributed algorithms

OPODIS'06 Proceedings of the 10th international conference on Principles of Distributed Systems
Dynamic virtual clustering with xen and moab

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
On distributed verification

ICDCN'06 Proceedings of the 8th international conference on Distributed Computing and Networking
Checkpointing and communication pattern-neutral algorithm for removing messages logged by senders

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Computational efficiency and practical implications for a client grid

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm

IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Self-stabilizing checkpointing algorithm in ring topology

IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Self-refined fault tolerance in HPC using dynamic dependent process groups

IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Requirements for secure logging of decentralized cross-organizational workflow executions

OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems
Self-stabilization of byzantine protocols

SSS'05 Proceedings of the 7th international conference on Self-Stabilizing Systems
The generalized deadlock resolution problem

ICALP'05 Proceedings of the 32nd international conference on Automata, Languages and Programming
Immediate detection of predicates in pervasive environments

Journal of Parallel and Distributed Computing
Predicate detection using event streams in ubiquitous environments

EUC'05 Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing
Global state detection based on peer-to-peer interactions

EUC'05 Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing
Nonintrusive snapshots using thin slices

EUC'05 Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing
A fault-tolerant multi-agent development framework

ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
Monitoring stable properties in dynamic peer-to-peer distributed systems

FSTTCS '05 Proceedings of the 25th international conference on Foundations of Software Technology and Theoretical Computer Science
A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Snapshot verification

TACAS'05 Proceedings of the 11th international conference on Tools and Algorithms for the Construction and Analysis of Systems
Implementing rollback-recovery coordinated checkpoints

ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
Performance evaluation of consistent recovery protocols using MPICH-GF

EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Transparent fault tolerance for grid applications

EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Solving collaborative fuzzy agents problems with CLP(FD)

PADL'05 Proceedings of the 7th international conference on Practical Aspects of Declarative Languages
Efficient reduction for wait-free termination detection in a crash-prone distributed system

DISC'05 Proceedings of the 19th international conference on Distributed Computing
Plausible clocks with bounded inaccuracy

DISC'05 Proceedings of the 19th international conference on Distributed Computing
A model for detecting "global footprint anomalies" in a grid environment

PAISI'10 Proceedings of the 2010 Pacific Asia conference on Intelligence and Security Informatics
Stable predicate detection in dynamic systems

OPODIS'05 Proceedings of the 9th international conference on Principles of Distributed Systems
Monitoring distributed controllers: when an efficient LTL algorithm on sequences is needed to model-check traces

FM'06 Proceedings of the 14th international conference on Formal Methods
Rigorous fault tolerance using aspects and formal methods

Rigorous Development of Complex Fault-Tolerant Systems
Distributed garbage collection for mobile actor systems: the pseudo root approach

GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
MadLINQ: large-scale distributed matrix computation for the cloud

Proceedings of the 7th ACM european conference on Computer Systems
A versatile STM protocol with invisible read operations that satisfies the virtual world consistency condition

SIROCCO'09 Proceedings of the 16th international conference on Structural Information and Communication Complexity
Monitoring for hierarchical web services compositions

TES'05 Proceedings of the 6th international conference on Technologies for E-Services
Analysis of interval-based global state detection

ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Efficient model checking for LTL with partial order snapshots

TACAS'06 Proceedings of the 12th international conference on Tools and Algorithms for the Construction and Analysis of Systems
Automated systematic testing of open distributed programs

FASE'06 Proceedings of the 9th international conference on Fundamental Approaches to Software Engineering
Distributed GraphLab: a framework for machine learning and data mining in the cloud

Proceedings of the VLDB Endowment
Research note: Self-stabilizing byzantine asynchronous unison

Journal of Parallel and Distributed Computing
On time complexity of distributed algorithms for generalized deadlock detection

ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems
Impact of over-decomposition on coordinated checkpoint/rollback protocol

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Research: Debugging tool for distributed Estelle programs

Computer Communications
Research: Modified distributed snapshots algorithm for protocol stabilization

Computer Communications
Optimal checkpointing interval of a communication system with rollback recovery

Mathematical and Computer Modelling: An International Journal
Memory management for many-core processors with software configurable locality policies

Proceedings of the 2012 international symposium on Memory Management
Virtual world consistency: A condition for STM systems (with a versatile protocol with invisible read operations)

Theoretical Computer Science
SeWDReSS: on the design of an application independent, secure, wide-area disaster recovery storage system

Multimedia Tools and Applications
Ensuring reliability in B2B services: Fault tolerant inter-organizational workflows

Information Systems Frontiers
Composable reliability for asynchronous systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Multi-agent A* for parallel and distributed systems

Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 3
Conservative synchronization methods for parallel DEVS and Cell-DEVS

Proceedings of the 2011 Summer Computer Simulation Conference
Time in State Machines

Fundamenta Informaticae - This is a SPECIAL ISSUE ON ASM'05
On snapshots and stable properties detection in anonymous fully distributed systems (extended abstract)

SIROCCO'12 Proceedings of the 19th international conference on Structural Information and Communication Complexity
Adding Partial Orders to Linear Temporal Logic

Fundamenta Informaticae
Alleviating scalability issues of checkpointing protocols

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Looking for a definition of dynamic distributed systems

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Fault tolerance: case study

Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
Detecting temporal logic predicates on distributed computations

DISC'07 Proceedings of the 21st international conference on Distributed Computing
Brief announcement: fast travellers: infrastructure-independent deadlock resolution in resource-restricted distributed systems

DISC'12 Proceedings of the 26th international conference on Distributed Computing
Specification and model checking of the chandy and lamport distributed snapshot algorithm in rewriting logic

ICFEM'12 Proceedings of the 14th international conference on Formal Engineering Methods: formal methods and software engineering
The viability of using compression to decrease message log sizes

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Nogood-based asynchronous forward checking algorithms

Constraints
Efficient distributed snapshots in an anonymous asynchronous message-passing system

Journal of Parallel and Distributed Computing
Failure recovery: when the cure is worse than the disease

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Leveraging SDN layering to systematically troubleshoot networks

Proceedings of the second ACM SIGCOMM workshop on Hot topics in software defined networking
A low complexity coordination architecture for networked supervisory medical systems

Proceedings of the ACM/IEEE 4th International Conference on Cyber-Physical Systems
Distributed wait state tracking for runtime MPI deadlock detection

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing
Energy efficiency in high-performance computing with and without knowledge of applications and services

International Journal of High Performance Computing Applications
Post-failure recovery of MPI communication capability: Design and rationale

International Journal of High Performance Computing Applications
An optimal distributed trigger counting algorithm for large-scale networked systems

Simulation
Consistency without borders

Proceedings of the 4th annual Symposium on Cloud Computing
Towards privacy-preserving fault detection

Proceedings of the 9th Workshop on Hot Topics in Dependable Systems
HotSnap: a hot distributed snapshot system for virtual machine cluster

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Specification and Verification of Concurrent Programs Through Refinements

Journal of Automated Reasoning
Compiler-Assisted Checkpointing of Parallel Codes: The Cetus and LLVM Experience

International Journal of Parallel Programming
Seeing through black boxes: Tracking transactions through queues under monitoring resource constraints

Performance Evaluation
Detecting stable locality-aware predicates

Journal of Parallel and Distributed Computing
Modeling, analyzing and slicing periodic distributed computations

Information and Computation
Libra: divide and conquer to verify forwarding tables in huge networks

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.06

Visualization

Abstract

This paper presents an algorithm by which a process in a distributed system determines a global state of the system during a computation. Many problems in distributed systems can be cast in terms of the problem of detecting global states. For instance, the global state detection algorithm helps to solve an important class of problems: stable property detection. A stable property is one that persists: once a stable property becomes true it remains true thereafter. Examples of stable properties are “computation has terminated,” “ the system is deadlocked” and “all tokens in a token ring have disappeared.” The stable property detection problem is that of devising algorithms to detect a given stable property. Global state detection can also be used for checkpointing.