Truth in SPEC benchmarks

Authors:
Nikki Mirghafori;Margret Jacoby;David Patterson
Affiliations:
University of California at Berkeley, Berkeley, CA;University of California at Berkeley, Berkeley, CA;University of California at Berkeley, Berkeley, CA
Venue:
ACM SIGARCH Computer Architecture News
Year:
1995

Citing 2
Cited 6

The effect of compiler-flag tuning on SPEC benchmark performance

ACM SIGARCH Computer Architecture News - Special issue on input/output in parallel computer systems
SPEC as a Performance Evaluation Measure

Computer

On the use of SPEC benchmarks in computer architecture research

ACM SIGARCH Computer Architecture News
Adapting the SPEC 2000 benchmark suite for simulation-based computer architecture research

Workload characterization of emerging computer applications
Benchmarking

Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Measuring Benchmark Similarity Using Inherent Program Characteristics

IEEE Transactions on Computers
Finding representative workloads for computer system design

Finding representative workloads for computer system design
Automatic generation of benchmark and test workloads

Proceedings of the first joint WOSP/SIPEW international conference on Performance engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The System Performance Evaluation Cooperative (SPEC) benchmarks are a set of integer and floating-point programs that are intended to be “effective and fair in comparing the performance of high performance computing systems”. SPEC ratings are often quoted in company advertising and have been trusted as the de facto measure of comparison for computer systems. Recently, there has been some concern regarding the fairness and the value of these benchmarks for comparing computer systems. In this paper we investigate the following two questions regarding the SPEC92 benchmark suite: 1) How sensitive are the SPEC ratings to various tunings? 2) How reproducible are the published results? For six vendors, we compare the published SPECpeak and SPECbase ratings, and observe an 11% average improvement in the SPECpeak ratings due to changes in the compiler flags alone. In our own attempt to reproduce the published SPEC ratings, we came across various “explicit” and “hidden” tuning parameters that we consider unrealistic. We suggest a new unit called SPECsimple that requires using only the -O compiler optimization flag, shared libraries, and standard system configuration. SPECsimple is designed to better match the performance experienced by a typical user. Our measured SPECsimples are 65-86% of the advertised SPECpeak performance. We conclude this paper by citing cases compiler optimizations specifically designed for SPEC programs, in which performance decreases drastically or the computed results are incorrect if the compiled program does not exactly match the SPEC benchmark program. These findings show that the fairness and value of the popular SPEC benchmarks are questionable.