Stratified random sampling for power estimation
Proceedings of the 1996 IEEE/ACM international conference on Computer-aided design
Design error simulation based on error modeling and sampling techniques
IMACS Selected papers from the IMACS European conference on Simulation
The grid: blueprint for a new computing infrastructure
The grid: blueprint for a new computing infrastructure
Estimation of software reliability by stratified sampling
ACM Transactions on Software Engineering and Methodology (TOSEM)
Managing performance analysis with dynamic statistical projection pursuit
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Future Generation Computer Systems - Special issue on metacomputing
Production Job Scheduling for Parallel Shared Memory Systems
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Automatic arima time series modeling and forecasting for adaptive input/output prefetching
Automatic arima time series modeling and forecasting for adaptive input/output prefetching
Reliability challenges in large systems
Future Generation Computer Systems
Tracking in a spaghetti bowl: monitoring transactions using footprints
SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Resource Information Aggregation in Hierarchical Grid Networks
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Reliability challenges in large systems
Future Generation Computer Systems
Clustering performance data efficiently at massive scales
Proceedings of the 24th ACM International Conference on Supercomputing
End-to-end framework for fault management for open source clusters: Ranger
Proceedings of the 2010 TeraGrid Conference
TA UoverSupermon: low-overhead online parallel performance monitoring
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
As the trend in parallel systems scales toward petaflop performance tapped by advances in circuit density and by an increasingly available computational Grid, the development of efficient mechanisms for monitoring large systems becomes imperative. When computational components are coupled via dynamically shifting connections with various remote resources, the number of potential factors affecting system behavior is enormous. Yet the overhead of monitoring can be prohibitive. In this paper we present a new technique for monitoring large systems based on statistical sampling. Rather than monitoring each component, we select a statistically valid sample and measure the behavior of sample members. We describe the formal requirements of sample selection and verify the feasibility of our approach with experiments on large parallel systems and wide-area networks. Our results show that this technique can be a powerful tool to enable effective monitoring without incurring the large costs typically associated to exhaustive checking.