Service time estimation with a refinement enhanced hybrid clustering algorithm

Authors:
Paolo Cremonesi;Kanika Dhyani;Andrea Sansottera
Affiliations:
Politecnico di Milano, Milan, Italy;Neptuny, s.r.l., Milan, Italy;Neptuny, s.r.l., Milan, Italy
Venue:
ASMTA'10 Proceedings of the 17th international conference on Analytical and stochastic modeling techniques and applications
Year:
2010

Citing 5
Cited 1

Quantitative system performance: computer system analysis using queueing network models

Quantitative system performance: computer system analysis using queueing network models
A fast algorithm for the minimum covariance determinant estimator

Technometrics
Robust weighted orthogonal regression in the errors-in-variables model

Journal of Multivariate Analysis
Robust Workload Estimation in Queueing Network Performance Models

PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
Linear grouping using orthogonal regression

Computational Statistics & Data Analysis

Indirect estimation of service demands in the presence of structural changes

Performance Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Inferring service time from workload and utilization data is important to predict the performance of computer systems. While the utilization law expresses a linear relationship between the workload submitted to a computing system and its utilization, the automated analysis of real world datasets is far from trivial. Hardware and software upgrades modify the service time and periodic activities affect the utilization law. Therefore, multiple regression lines must be found in the datasets to explain the different behaviours of the system. In this paper, we propose a new methodology that works in three main phases, which involve clustering based on density of points, splitting of clusters and estimation of regression lines obtained from our extension of a clusterwise regression algorithm and a refinement procedure to remove and merge clusters. A cumulative effect of these phases is the simultaneous determination of the number of clusters while correctly identifying the point-to-cluster membership, the underlying regression lines and the outliers. A novel feature of our approach is that the selection of the number of clusters exploits the structure of the data and is not based on the model complexity as in most previous methods. A computational comparison of our method with suitable existing approaches on real world data as well as challenging synthetic "realistic" data shows the efficiency of our algorithm.