Multi-scale analysis of large distributed computing systems

Authors:
Lucas Mello Schnorr;Arnaud Legrand;Jean-Marc Vincent
Affiliations:
INRIA MESCAL, CNRS LIG, Grenoble, France;INRIA MESCAL, CNRS LIG, Grenoble, France;INRIA MESCAL, CNRS LIG, Grenoble, France
Venue:
Proceedings of the third international workshop on Large-scale system and application performance
Year:
2011

Citing 18
Cited 0

Using MPI: portable parallel programming with the message-passing interface

Using MPI: portable parallel programming with the message-passing interface
The network weather service: a distributed resource performance forecasting service for metacomputing

Future Generation Computer Systems - Special issue on metacomputing
Pajé: An Extensible Environment for Visualizing Multi-threaded Programs Executions

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
BOINC: A System for Public-Resource Computing and Storage

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Resource Management for Rapid Application Turnaround on Enterprise Desktop Grids

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Toward Scalable Performance Visualization with Jumpshot

International Journal of High Performance Computing Applications
The Tau Parallel Performance System

International Journal of High Performance Computing Applications
The Computational and Storage Potential of Volunteer Computing

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
SimGrid: A Generic Framework for Large-Scale Distributed Experiments

UKSIM '08 Proceedings of the Tenth International Conference on Computer Modeling and Simulation
Towards Visualization Scalability through Time Intervals and Hierarchical Organization of Monitoring Data

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
GridBot: execution of bags of tasks in multiple grids

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Triva: Interactive 3D visualization for performance analysis of parallel applications

Future Generation Computer Systems
A taxonomy of grid monitoring systems

Future Generation Computer Systems
The Scalasca performance toolset architecture

Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Fast and scalable simulation of volunteer computing systems using SimGrid

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Discovering Statistical Models of Availability in Large Distributed Systems: An Empirical Study of SETI@home

IEEE Transactions on Parallel and Distributed Systems
Implementation and usage of the PERUSE-Interface in open MPI

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large scale distributed systems are composed of many thousands of computing units. Today's examples of such systems are grid, volunteer and cloud computing platforms. Generally, their analyses are done through monitoring tools that gather resource information like processor or network utilization, providing high-level statistics and basic resource usage traces. Such approaches are recognized as rather scalable but are unfortunately often insufficient to detect or fully understand unexpected behavior. In this paper, we investigate the use of more detailed tracing techniques --commonly used in parallel computing-- in distributed systems. Finely analyzing the behavior of such systems comprising thousands of resources over several months may seem infeasible. Yet, we show that the resulting trace can be analyzed using tools that enable to easily zoom in and out on selected area of space and time. We use the BOINC volunteer computing system as a basis of this study. Since detailed activity traces of the BOINC clients are not available yet, we rely instead on traces obtained through a BOINC simulator developed with the SimGrid toolkit and which uses as input real availability trace files from the Seti@Home BOINC project. We show that the analysis of such detailed resource utilization traces provides several non-trivial insights about the whole system and enables the discovery of unexpected behavior.