Detection and analysis of resource usage anomalies in large distributed systems through multi-scale visualization

Authors:
Lucas Mello Schnorr;Arnaud Legrand;Jean-Marc Vincent
Affiliations:
INRIA, CNRS, University of Grenoble, Grenoble, France;INRIA, CNRS, University of Grenoble, Grenoble, France;INRIA, CNRS, University of Grenoble, Grenoble, France
Venue:
Concurrency and Computation: Practice & Experience
Year:
2012

Citing 33
Cited 1

The network weather service: a distributed resource performance forecasting service for metacomputing

Future Generation Computer Systems - Special issue on metacomputing
Using MPI-2: Advanced Features of the Message Passing Interface

Using MPI-2: Advanced Features of the Message Passing Interface
Visualisation of Distributed Applications for Performance Debugging

ICCS '01 Proceedings of the International Conference on Computational Science-Part II
Pajé: An Extensible Environment for Visualizing Multi-threaded Programs Executions

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
A Trace-Scaling Agent for Parallel Application Tracing

ICTAI '02 Proceedings of the 14th IEEE International Conference on Tools with Artificial Intelligence
Tree-Maps: a space-filling approach to the visualization of hierarchical information structures

VIS '91 Proceedings of the 2nd conference on Visualization '91
BOINC: A System for Public-Resource Computing and Storage

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Resource Management for Rapid Application Turnaround on Enterprise Desktop Grids

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Automatic Experimental Analysis of Communication Patterns in Virtual Topologies

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Toward Scalable Performance Visualization with Jumpshot

International Journal of High Performance Computing Applications
The Tau Parallel Performance System

International Journal of High Performance Computing Applications
The Computational and Storage Potential of Volunteer Computing

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
DIMVisual: Data Integration Model for Visualization of Parallel Programs Behavior

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Measuring Benchmark Similarity Using Inherent Program Characteristics

IEEE Transactions on Computers
Performance Evaluation of Scheduling Policies for Volunteer Computing

E-SCIENCE '07 Proceedings of the Third IEEE International Conference on e-Science and Grid Computing
SimGrid: A Generic Framework for Large-Scale Distributed Experiments

UKSIM '08 Proceedings of the Tenth International Conference on Computer Modeling and Simulation
Deploying the LHC computing grid - the LCG service challenges

LGDI '05 Proceedings of the 2005 IEEE International Symposium on Mass Storage Systems and Technology
Towards Visualization Scalability through Time Intervals and Hierarchical Organization of Monitoring Data

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Automatic detection of parallel applications computation phases

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
GridBot: execution of bags of tasks in multiple grids

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Evaluating similarity-based trace reduction techniques for scalable performance analysis

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Triva: Interactive 3D visualization for performance analysis of parallel applications

Future Generation Computer Systems
Visual Mapping of Program Components to Resources Representation: A 3D Analysis of Grid Parallel Applications

SBAC-PAD '09 Proceedings of the 2009 21st International Symposium on Computer Architecture and High Performance Computing
A taxonomy of grid monitoring systems

Future Generation Computer Systems
The Scalasca performance toolset architecture

Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
Visualization of repetitive patterns in event traces

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Fast and scalable simulation of volunteer computing systems using SimGrid

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A systematic multi-step methodology for performance analysis of communication traces of distributed applications based on hierarchical clustering

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Effective Performance Measurement at Petascale Using IPM

ICPADS '10 Proceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems
Kremlin: like gprof, but for parallelization

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Implementation and usage of the PERUSE-Interface in open MPI

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface

On the validity of flow-level tcp network models for grid and cloud simulations

ACM Transactions on Modeling and Computer Simulation (TOMACS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Understanding the behavior of large scale distributed systems is generally extremely difficult as it requires to observe a very large number of components over very large time. Most analysis tools for distributed systems gather basic information such as individual processor or network utilization. Although scalable because of the data reduction techniques applied before the analysis, these tools are often insufficient to detect or fully understand anomalies in the dynamic behavior of resource utilization and their influence on the applications performance. In this paper, we propose a methodology for detecting resource usage anomalies in large scale distributed systems. The methodology relies on four functionalities: characterized trace collection, multi-scale data aggregation, specifically tailored user interaction techniques, and visualization techniques. We show the efficiency of this approach through the analysis of simulations of the volunteer computing Berkeley Open Infrastructure for Network Computing architecture. Three scenarios are analyzed in this paper: analysis of the resource sharing mechanism, resource usage considering response time instead of throughput, and the evaluation of input file size on Berkeley Open Infrastructure for Network Computing architecture. The results show that our methodology enables to easily identify resource usage anomalies, such as unfair resource sharing, contention, moving network bottlenecks, and harmful short-term resource sharing. Copyright © 2011 John Wiley & Sons, Ltd.