Monitoring Large Systems Via Statistical Sampling

Authors:
Celso L. Mendes;Daniel A. Reed
Affiliations:
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF ILLINOIS URBANA, USA;DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF ILLINOIS URBANA, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2004

Citing 8
Cited 8

Stratified random sampling for power estimation

Proceedings of the 1996 IEEE/ACM international conference on Computer-aided design
Design error simulation based on error modeling and sampling techniques

IMACS Selected papers from the IMACS European conference on Simulation
The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
Estimation of software reliability by stratified sampling

ACM Transactions on Software Engineering and Methodology (TOSEM)
Managing performance analysis with dynamic statistical projection pursuit

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
The network weather service: a distributed resource performance forecasting service for metacomputing

Future Generation Computer Systems - Special issue on metacomputing
Production Job Scheduling for Parallel Shared Memory Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Automatic arima time series modeling and forecasting for adaptive input/output prefetching

Automatic arima time series modeling and forecasting for adaptive input/output prefetching

Reliability challenges in large systems

Future Generation Computer Systems
Tracking in a spaghetti bowl: monitoring transactions using footprints

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Resource Information Aggregation in Hierarchical Grid Networks

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Reliability challenges in large systems

Future Generation Computer Systems
Clustering performance data efficiently at massive scales

Proceedings of the 24th ACM International Conference on Supercomputing
End-to-end framework for fault management for open source clusters: Ranger

Proceedings of the 2010 TeraGrid Conference
TA UoverSupermon: low-overhead online parallel performance monitoring

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the trend in parallel systems scales toward petaflop performance tapped by advances in circuit density and by an increasingly available computational Grid, the development of efficient mechanisms for monitoring large systems becomes imperative. When computational components are coupled via dynamically shifting connections with various remote resources, the number of potential factors affecting system behavior is enormous. Yet the overhead of monitoring can be prohibitive. In this paper we present a new technique for monitoring large systems based on statistical sampling. Rather than monitoring each component, we select a statistically valid sample and measure the behavior of sample members. We describe the formal requirements of sample selection and verify the feasibility of our approach with experiments on large parallel systems and wide-area networks. Our results show that this technique can be a powerful tool to enable effective monitoring without incurring the large costs typically associated to exhaustive checking.