War of the benchmark means: time for a truce

Authors:
John R. Mashey
Affiliations:
Techviser
Venue:
ACM SIGARCH Computer Architecture News
Year:
2004

Citing 6
Cited 6

How not to lie with statistics: the correct way to summarize benchmark results

Communications of the ACM - The MIT Press scientific computation series
Characterizing computer performance with a single number

Communications of the ACM
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Measuring computer performance: a practitioner's guide

Measuring computer performance: a practitioner's guide
SPEC as a Performance Evaluation Measure

Computer
More on finding a single number to indicate overall performance of a benchmark suite

ACM SIGARCH Computer Architecture News

The harmonic or geometric mean: does it really matter?

ACM SIGARCH Computer Architecture News
SubsetTrio: An evolutionary, geometric, and statistical benchmark subsetting framework

ACM Transactions on Modeling and Computer Simulation (TOMACS)
How to measure useful, sustained performance

State of the Practice Reports
Top500 versus sustained performance: the top problems with the top500 list - and what to do about them

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Web based multi-platform benchmark program construction in smartphone

Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Assessing computer performance with stocs

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

For decades, computer benchmarkers have fought a War of Means. Although many have raised concerns with the geometric mean (GM), it continues to be used by SPEC and others. This war is an unnecessarymisunderstanding due to inadequately articulated implicit assumptions, plus confusio namong populations, their parameters, sampling methods, and sample statistics. In fact, all the Means have their uses, sometimes in combination. Metrics may be algebraically correct, but statistically irrelevant or misleading if applied to population distributions for which they are inappropriate. Normal (Gaussian) distributions are so useful that they are often assumed without question,but many important distributions are not normal.They require different analyses, most commonly by finding a mathematical transformations that yields a normal distribution,computing the metrics, and then back-transforming to the original scale. Consider the distribution of relative performance ratios of programs on two computers. The normal distribution is a good fit only when variance and skew are small, but otherwise generates logical impossibilities and misleading statistical measures. A much better choice is the lognormal (or log-normal) distribution, not just on theoretical grounds, but through the (necessary) validation with real data. Normal and lognormal distributions are similar for low variance and skew, but the lognormal handles skewed distributions reasonably, unlike the normal. Lognormal distributions occur frequently elsewhere are well-understood, and have standard methods of analysis.Everyone agrees that "Performance is not a single number," ... and then argues about which number is better. It is more important to understanding populations, appropriate methods, and proper ways to convey uncertainty. When population parameters are estimated via samples, statistically correct methods must be used to produce the appropriate means, measures of dispersion, Skew, confidence levels, and perhaps goodness-of-fit estimators. If the wrong Mean is chosen, it is difficult to achieve much. The GM predicts the mean relative performance of programs, not of workloads. The usual GM formula is rather unintuitive, and is often claimed to have no physical meaning. However, it is the back-transformed average of a lognormal distribution, as can be seen by the mathematical identity below. Its use is not onlystatistically appropriate in some cases, but enables straightforward computation of other useful statistics."If a man will begin in certainties, he shall end in doubts, but if he will be content to begin with doubts, he shall end with certainties." — Francis Bacon, in Savage.