Monitoring distributed systems
ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Metric (Extended Abstract): A kernel instrumentation system for distributed environments
SOSP '77 Proceedings of the sixth ACM symposium on Operating systems principles
Generalized path expressions: A high level debugging mechanism (Preliminary Draft)
SIGSOFT '83 Proceedings of the ACM SIGSOFT/SIGPLAN software engineering symposium on High-level debugging
Development of a debugger for a concurrent language
SIGSOFT '83 Proceedings of the ACM SIGSOFT/SIGPLAN software engineering symposium on High-level debugging
INTERACTIVE DEBUGGING IN A DISTRIBUTED COMPUTATIONAL
INTERACTIVE DEBUGGING IN A DISTRIBUTED COMPUTATIONAL
Performance Characterization of Distributed Programs
Performance Characterization of Distributed Programs
A Noninterference Monitoring and Replay Mechanism for Real-Time Software Testing and Debugging
IEEE Transactions on Software Engineering
A bibliography of parallel debuggers, 1993 edition
PADD '93 Proceedings of the 1993 ACM/ONR workshop on Parallel and distributed debugging
Automatic detection of errors in distributed systems
CSC '95 Proceedings of the 1995 ACM 23rd annual conference on Computer science
Using Hy+ for network management and distributed debugging
CASCON '93 Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: software engineering - Volume 1
Hi-index | 0.00 |
The authors describe a tool called TAP, which is defined to aid the programmer in discovering the causes of timing errors in running programs. TAP is similar to a postmortem debugger, using the history of interprocess communication to construct a timing graph, a directed graph where an edge joins node x to node y if event x directly precedes event y in time. The programmer can then use TAP to look at the graph to find the events that occurred in an unacceptable order. Because of the nondeterministic nature of distributed programs, the authors feel a history-keeping mechanism but always be active so that bugs can be dealt with as they occur. The goal is to collect enough information at run time to construct the timing graph if needed. Since it is always active, this mechanism must be efficient. The authors also describe experiments run using TAP and report the impact that TAP's history-keeping mechanism has on the running time of various distributed programs.