Traces synchronization in distributed networks

Authors:
Eric Clément;Michel Dagenais
Affiliations:
Department of Computer Engineering, École Polytechnique de Montréal, Montreal, QC, Canada;Department of Computer Engineering, École Polytechnique de Montréal, Montreal, QC, Canada
Venue:
Journal of Computer Systems, Networks, and Communications
Year:
2009

Citing 8
Cited 1

The Accuracy of the Clock Synchronization Achieved by TEMPO in Berkeley UNIX 4.3BSD

IEEE Transactions on Software Engineering
Hypercube clock synchronization

Concurrency: Practice and Experience
Improved algorithms for synchronizing computer network clocks

IEEE/ACM Transactions on Networking (TON)
Experience with an adaptive globally-synchronizing clock algorithm

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
MPI support in the Prism programming environment

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
From trace generation to visualization: a performance framework for distributed parallel systems

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Understanding Linux Network Internals

Understanding Linux Network Internals
Computer Network Time Synchronization: The Network Time Protocol

Computer Network Time Synchronization: The Network Time Protocol

Accurate offline synchronization of distributed traces using kernel-level events

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article proposes a novel approach to synchronize a posteriori the detailed execution traces from several networked computers. It can be used to debug and investigate complex performance problems in systems where several computers exchange information. When the distributed system is under study, detailed execution traces are generated locally on each system using an efficient and accurate system level tracer, LTTng. When the tracing is finished, the individual traces are collected and analysed together. The messaging events in all the traces are then identified and correlated in order to estimate the time offset over time between each node. The time offset computation imprecision, associated with asymmetric network delays and operating system latency in message sending and receiving, is amortized over a large time interval through a linear least square fit over several messages covering a large time span. The resulting accuracy is such that it is possible to estimate the clock offsets in a distributed system, even with a relatively low volume of messages exchanged, to within the order of a microsecond while having a very low impact on the system execution, which is sufficient to properly order the events traced on the individual computers in the distributed system.