Using Model Checking to Analyze the System Behavior of the LHC Production Grid

Authors:
Daniela Remenska;Tim A. C. Willemse;Kees Verstoep;Wan Fokkink;Jeff Templon;Henri Bal
Affiliations:
-;-;-;-;-;-
Venue:
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Year:
2012

Citing 17
Cited 1

Design and validation of computer protocols

Design and validation of computer protocols
Slicing Software for Model Construction

Higher-Order and Symbolic Computation
The SLAM Toolkit

CAV '01 Proceedings of the 13th International Conference on Computer Aided Verification
Verifying protocols by model checking: a case study of the wireless application protocol and the model checker SPIN

CASCON '04 Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research
Verification of a sliding window protocol in μCRL and PVS

Formal Aspects of Computing
Model-checking processes with data

Science of Computer Programming
Model checking large network protocol implementations

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Modelling Distributed Systems (Texts in Theoretical Computer Science. An EATCS Series)

Modelling Distributed Systems (Texts in Theoretical Computer Science. An EATCS Series)
Efficient large-scale model checking

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Behavioural Analysis of an I2C Linux Driver

FMICS '09 Proceedings of the 14th International Workshop on Formal Methods for Industrial Critical Systems
Process Algebra: Equational Theories of Communicating Processes

Process Algebra: Equational Theories of Communicating Processes
Experiences in developing the mCRL2 toolset

Software—Practice & Experience
LTSMIN: distributed and symbolic reachability

CAV'10 Proceedings of the 22nd international conference on Computer Aided Verification
Model checking programs with java pathfinder

SPIN'05 Proceedings of the 12th international conference on Model Checking Software
Design and analysis techniques for concurrent blackboard systems

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Analysing the control software of the compact muon solenoid experiment at the large hadron collider

FSEN'11 Proceedings of the 4th IPM international conference on Fundamentals of Software Engineering
Formal Analysis of SystemC Designs in Process Algebra

Fundamenta Informaticae

An overview of the mCRL2 toolset and its recent advances

TACAS'13 Proceedings of the 19th international conference on Tools and Algorithms for the Construction and Analysis of Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

DIRAC (Distributed Infrastructure with Remote Agent Control) is the grid solution designed to support production activities as well as user data analysis for the Large Hadron Collider "beauty" experiment. It consists of cooperating distributed services and a plethora of light-weight agents delivering the workload to the grid resources. Services accept requests from agents and running jobs, while agents actively fulfill specific goals. Services maintain database back-ends to store dynamic state information of entities such as jobs, queues, or requests for data transfer. Agents continuously check for changes in the service states, and react to these accordingly. The logic of each agent is rather simple, the main source of complexity lies in their cooperation. These agents run concurrently, and communicate using the services' databases as a shared memory for synchronizing the state transitions. Despite the effort invested in making DIRAC reliable, entities occasionally get into inconsistent states. Tracing and fixing such behaviors is difficult, given the inherent parallelism among the distributed components and the size of the implementation. In this paper we present an analysis of DIRAC with mCRL2, process algebra with data. We have reverse engineered two critical and related DIRAC subsystems, and subsequently modeled their behavior with the mCRL2 toolset. This enabled us to easily locate race conditions and live locks which were confirmed to occur in the real system. We further formalized and verified several behavioral properties of the two modeled subsystems.