Producing wrong data without doing anything obviously wrong!

Authors:
Todd Mytkowicz;Amer Diwan;Matthias Hauswirth;Peter F. Sweeney
Affiliations:
University of Colorado, Boulder, CO, USA;University of Colorado, Boulder, CO, USA;University of Lugano, Lugano, Switzerland;IBM Research, Hawthorne, NY, USA
Venue:
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Year:
2009

Citing 10
Cited 50

Causality: models, reasoning, and inference

Causality: models, reasoning, and inference
A scalable cross-platform infrastructure for application performance tuning using hardware counters

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
A Comparison of Counting and Sampling Modes of Using Performance Monitoring Hardware

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
Variability in Architectural Simulations of Multi-Threaded Workloads

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Myths and realities: the performance impact of garbage collection

Proceedings of the joint international conference on Measurement and modeling of computer systems
Understanding the behavior of compiler optimizations

Software—Practice & Experience - Research Articles
The M5 Simulator: Modeling Networked Systems

IEEE Micro
The DaCapo benchmarks: java benchmarking development and analysis

Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
Statistically rigorous java performance evaluation

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Reducing Performance Evaluation Sensitivity and Variability by Input Shaking

MASCOTS '07 Proceedings of the 2007 15th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems

Blind Optimization for Exploiting Hardware Features

CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
Raced profiles: efficient selection of competing compiler optimizations

Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Binary analysis for measurement and attribution of program performance

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Inferred call path profiling

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Diagnosing performance bottlenecks in emerging petascale applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Early experience with a commercial hardware transactional memory implementation

Early experience with a commercial hardware transactional memory implementation
VM performance evaluation with functional models: an optimist's outlook

Proceedings of the Third Workshop on Virtual Machines and Intermediate Languages
Studying microarchitectural structures with object code reordering

Proceedings of the Workshop on Binary Instrumentation and Applications
Evaluating the accuracy of Java profilers

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Evaluating iterative optimization across 1000 datasets

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Automated program repair through the evolution of assembly code

Proceedings of the IEEE/ACM international conference on Automated software engineering
What can the GC compute efficiently?: a language for heap assertions at GC time

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Workshop on experimental evaluation of software and systems in computer science (Evaluate 2010)

Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion
Exact temporal characterization of 10 Gbps optical wide-area network

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Collective optimization: A practical collaborative approach

ACM Transactions on Architecture and Code Optimization (TACO)
An empirical assessment of approaches to distributed enforcement in role-based access control (RBAC)

Proceedings of the first ACM conference on Data and application security and privacy
Memory system performance in a NUMA multicore multiprocessor

Proceedings of the 4th Annual International Conference on Systems and Storage
Disks are like snowflakes: no two are alike

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

Proceedings of the international symposium on Memory management
Counting messages as a proxy for average execution time in pharo

Proceedings of the 25th European conference on Object-oriented programming
Automated GUI performance testing

Software Quality Control
Repeatability, reproducibility, and rigor in systems research

EMSOFT '11 Proceedings of the ninth ACM international conference on Embedded software
A literate experimentation manifesto

Proceedings of the 10th SIGPLAN symposium on New ideas, new paradigms, and reflections on programming and software
Hardware performance monitoring for the rest of us: a position and survey

NPC'11 Proceedings of the 8th IFIP international conference on Network and parallel computing
Compiler mitigations for time attacks on modern x86 processors

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Compiler techniques to improve dynamic branch prediction for indirect jump and call instructions

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Optimizing interpreters by tuning opcode orderings on virtual machines for modern architectures: or: how I learned to stop worrying and love hill climbing

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
A microbenchmark case study and lessons learned

Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE!'11, AOOPES'11, NEAT'11, & VMIL'11
Measurement and dynamical analysis of computer performance data

IDA'10 Proceedings of the 9th international conference on Advances in Intelligent Data Analysis
Can linear approximation improve performance prediction ?

EPEW'11 Proceedings of the 8th European conference on Computer Performance Engineering
Computer memory: why we should care what is under the hood

MEMICS'11 Proceedings of the 7th international conference on Mathematical and Engineering Methods in Computer Science
MAO -- An extensible micro-architectural optimizer

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Predicting performance via automated feature-interaction detection

Proceedings of the 34th International Conference on Software Engineering
Deconstructing iterative optimization

ACM Transactions on Architecture and Code Optimization (TACO)
Kitsune: efficient, general-purpose dynamic software updating for C

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
From relational verification to SIMD loop synthesis

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
R3: repeatability, reproducibility and rigor

ACM SIGPLAN Notices - Supplemental issue
Why you should care about quantile regression

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
STABILIZER: statistically sound performance evaluation

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
A proper performance evaluation system that summarizes code placement effects

Proceedings of the 11th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering
Rigorous benchmarking in reasonable time

Proceedings of the 2013 international symposium on memory management
DataMill: rigorous performance evaluation made easy

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Parallelism profiling and wall-time prediction for multi-threaded applications

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Assessing computer performance with stocs

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Probabilistic timing analysis on conventional cache designs

Proceedings of the Conference on Design, Automation and Test in Europe
A study of performance variations in the Mozilla Firefox web browser

ACSC '13 Proceedings of the Thirty-Sixth Australasian Computer Science Conference - Volume 135
Post-compiler software optimization for reducing energy

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Revisiting memory management on virtualized environments

ACM Transactions on Architecture and Code Optimization (TACO)
Towards software performance engineering for multicore and manycore systems

ACM SIGMETRICS Performance Evaluation Review
Scheduler vulnerabilities and coordinated attacks in cloud computing

Journal of Computer Security

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a surprising result: changing a seemingly innocuous aspect of an experimental setup can cause a systems researcher to draw wrong conclusions from an experiment. What appears to be an innocuous aspect in the experimental setup may in fact introduce a significant bias in an evaluation. This phenomenon is called measurement bias in the natural and social sciences. Our results demonstrate that measurement bias is significant and commonplace in computer system evaluation. By significant we mean that measurement bias can lead to a performance analysis that either over-states an effect or even yields an incorrect conclusion. By commonplace we mean that measurement bias occurs in all architectures that we tried (Pentium 4, Core 2, and m5 O3CPU), both compilers that we tried (gcc and Intel's C compiler), and most of the SPEC CPU2006 C programs. Thus, we cannot ignore measurement bias. Nevertheless, in a literature survey of 133 recent papers from ASPLOS, PACT, PLDI, and CGO, we determined that none of the papers with experimental results adequately consider measurement bias. Inspired by similar problems and their solutions in other sciences, we describe and demonstrate two methods, one for detecting (causal analysis) and one for avoiding (setup randomization) measurement bias.