Rigorous benchmarking in reasonable time

Authors:
Tomas Kalibera;Richard Jones
Affiliations:
University of Kent, Canterbury, United Kingdom;University of Kent, Canterbury, United Kingdom
Venue:
Proceedings of the 2013 international symposium on memory management
Year:
2013

Citing 10
Cited 1

Measuring computer performance: a practitioner's guide

Measuring computer performance: a practitioner's guide
Online feedback-directed optimization of Java

OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Efficiently Evaluating Speedup Using Sampled Processor Simulation

IEEE Computer Architecture Letters
The DaCapo benchmarks: java benchmarking development and analysis

Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
Replay compilation: improving debuggability of a just-in-time compiler

Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
Statistically rigorous java performance evaluation

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Java performance evaluation through rigorous replay compilation

Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Producing wrong data without doing anything obviously wrong!

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Precise regression benchmarking with random effects: improving mono benchmark results

EPEW'06 Proceedings of the Third European conference on Formal Methods and Stochastic Models for Performance Evaluation
STABILIZER: statistically sound performance evaluation

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Exploiting slicing and patterns for RTSJ immortal memory optimization

Proceedings of the 2013 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools

Quantified Score

Hi-index	0.00

Visualization

Abstract

Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many iterations of benchmarks, invoke virtual machines (VMs) several times, or even rebuild VM or benchmark binaries more than once. All this repetition costs time to complete experiments. Currently, many evaluations give up on sufficient repetition or rigorous statistical methods, or even run benchmarks only in training sizes. The results reported often lack proper variation estimates and, when a small difference between two systems is reported, some are simply unreliable. In contrast, we provide a statistically rigorous methodology for repetition and summarising results that makes efficient use of experimentation time. Time efficiency comes from two key observations. First, a given benchmark on a given platform is typically prone to much less non-determinism than the common worst-case of published corner-case studies. Second, repetition is most needed where most uncertainty arises (whether between builds, between executions or between iterations). We capture experimentation cost with a novel mathematical model, which we use to identify the number of repetitions at each level of an experiment necessary and sufficient to obtain a given level of precision. We present our methodology as a cookbook that guides researchers on the number of repetitions they should run to obtain reliable results. We also show how to present results with an effect size confidence interval. As an example, we show how to use our methodology to conduct throughput experiments with the DaCapo and SPEC CPU benchmarks on three recent platforms.