Variability in Architectural Simulations of Multi-Threaded Workloads

Authors:
Alaa R. Alameldeen;David A. Wood
Affiliations:
-;-
Venue:
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Year:
2003

Citing 27
Cited 63

Accurate Low-Cost Methods for Performance Evaluation of Cache Memory Systems

IEEE Transactions on Computers
The cedar system and an initial performance study

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Contrasting characteristics and cache performance of technical and multi-user commercial workloads

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Execution-driven simulation of multiprocessors: address and timing analysis

ACM Transactions on Modeling and Computer Simulation (TOMACS)
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Memory system characterization of commercial workloads

Proceedings of the 25th annual international symposium on Computer architecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads

Proceedings of the 25th annual international symposium on Computer architecture
Accurate indirect branch prediction

Proceedings of the 25th annual international symposium on Computer architecture
The YAGS branch prediction scheme

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
HLS: combining statistical and symbolic simulation to guide microprocessor designs

Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Timestamp snooping: an approach for extending SMPs

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
MemorIES3: a programmable, real-time hardware emulation tool for multiprocessor server design

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
FLASH vs. (Simulated) FLASH: closing the simulation loop

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Full-system timing-first simulation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Measuring Experimental Error in Microprocessor Simulation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Automatically characterizing large scale program behavior

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Simics: A Full System Simulation Platform

Computer
Starfire: Extending the SMP Envelope

IEEE Micro
A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches

IEEE Transactions on Computers
Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
An Direct-Execution Framework for Fast and Accurate Simulation of Superscalar Processors

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Bandwidth Adaptive Snooping

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
The Design of COMPASS: An Execution Driven Simulator for Commercial Applications Running on Shared Memory Multiprocessors

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
The Effects of Mispredicted-Path Execution on Branch Prediction Structures

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
A multithreaded PowerPC processor for commercial servers

IBM Journal of Research and Development

Simulating a $2M Commercial Server on a $2K PC

Computer
Run-time modeling and estimation of operating system power consumption

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Memory System Behavior of Java-Based Middleware

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Scaling and Charact rizing Database Workloads: Bridging the Gap between Research and Practice

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Memory Ordering: A Value-Based Approach

Proceedings of the 31st annual international symposium on Computer architecture
Adaptive Cache Compression for High-Performance Processors

Proceedings of the 31st annual international symposium on Computer architecture
A case for shared instruction cache on chip multiprocessors running OLTP

MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
Coherence decoupling: making use of incoherence

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Memory Ordering: A Value-Based Approach

IEEE Micro
Optimizing Replication, Communication, and Capacity Allocation in CMPs

Proceedings of the 32nd annual international symposium on Computer Architecture
Evaluating IA-32 web servers through simics: a practical experience

Journal of Systems Architecture: the EUROMICRO Journal
Maximizing CMP Throughput with Mediocre Cores

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Simulating Commercial Java Throughput Workloads: A Case Study

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
The RASE (Rapid, Accurate Simulation Environment) for chip multiprocessors

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Simulation of Computer Architectures: Simulators, Benchmarks, Methodologies, and Recommendations

IEEE Transactions on Computers
Hardware support for spin management in overcommitted virtual machines

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
IPC Considered Harmful for Multiprocessor Workloads

IEEE Micro
SimFlex: Statistical Sampling of Computer System Simulation

IEEE Micro
Computation spreading: employing hardware migration to specialize CMP cores on-the-fly

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Coherence Ordering for Ring-based Chip Multiprocessors

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Determining output uncertainty of computer system models

Performance Evaluation
Virtual hierarchies to support server consolidation

Proceedings of the 34th annual international symposium on Computer architecture
Performance pathologies in hardware transactional memory

Proceedings of the 34th annual international symposium on Computer architecture
MetaTM/TxLinux: transactional memory for an operating system

Proceedings of the 34th annual international symposium on Computer architecture
TxLinux: using and managing hardware transactional memory in an operating system

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
SimWattch: Integrating Complete-System and User-Level Performance and Power Simulators

IEEE Micro
Applying Statistical Sampling for Fast and Efficient Simulation of Commercial Workloads

IEEE Transactions on Computers
Circuit-Switched Coherence

NOCS '08 Proceedings of the Second ACM/IEEE International Symposium on Networks-on-Chip
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Producing wrong data without doing anything obviously wrong!

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Phantom-BTB: a virtualized branch target buffer design

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Modeling transactional memory workload performance

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Reducing performance non-determinism via cache-aware page allocation strategies

Proceedings of the first joint WOSP/SIPEW international conference on Performance engineering
Flexible architectural support for fine-grain scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An analysis of on-chip interconnection networks for large-scale chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
Timetraveler: exploiting acyclic races for optimizing memory race recording

Proceedings of the 37th annual international symposium on Computer architecture
Proximity coherence for chip multiprocessors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
SWEL: hardware cache coherence protocols to map shared data onto shared caches

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Netrace: dependency-driven trace-based network-on-chip simulation

Proceedings of the Third International Workshop on Network on Chip Architectures
Fractal Coherence: Scalably Verifiable Cache Coherence

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Transactional conflict decoupling and value prediction

Proceedings of the international conference on Supercomputing
Multiset signatures for transactional memory

Proceedings of the international conference on Supercomputing
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks

Proceedings of the 38th annual international symposium on Computer architecture
Filtering directory lookups in CMPs with write-through caches

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Filtering directory lookups in CMPs

Microprocessors & Microsystems
Thread Tranquilizer: Dynamically reducing performance variation

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Switch-based packing technique to reduce traffic and latency in token coherence

Journal of Parallel and Distributed Computing
Trace-driven simulation of memory system scheduling in multithread application

Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
UniFI: leveraging non-volatile memories for a unified fault tolerance and idle power management technique

Proceedings of the 26th ACM international conference on Supercomputing
Something old and something new: P-states can borrow microarchitecture techniques too

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
XPoint cache: scaling existing bus-based coherence protocols for 2D and 3D many-core systems

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
STABILIZER: statistically sound performance evaluation

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Wait-n-GoTM: improving HTM performance by serializing cyclic dependencies

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
MapReduce with communication overlap (MaRCO)

Journal of Parallel and Distributed Computing
A study of performance variations in the Mozilla Firefox web browser

ACSC '13 Proceedings of the Thirty-Sixth Australasian Computer Science Conference - Volume 135
Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
High-performance fractal coherence

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
PCantorSim: Accelerating parallel architecture simulation through fractal-based sampling

ACM Transactions on Architecture and Code Optimization (TACO)
The case of using multiple streams in streaming

International Journal of Automation and Computing
Removal of Conflicts in Hardware Transactional Memory Systems

International Journal of Parallel Programming

Quantified Score

Hi-index	0.01

Visualization

Abstract

Multi-threaded commercial workloads implement many important internet services. Consequently, these workloads are increasingly used to evaluate the performance of uniprocessor and multiprocessor system designs. This paper identifies performance variability as a potentially major challenge for architectural simulation studies using these workloads. Variability refers to the differences between multiple estimates of a work-load's performance. Time variability occurs when a workload exhibits different characteristics during different phases of a single run. Space variability occurs when small variations in timing cause runs starting from the same initial condition to follow widely different execution paths.Variability is a well-known phenomenon in real systems, but is nearly universally ignored in simulation experiments. In a central result of this paper, we show that variability in multi-threaded commercial workloads can lead to incorrect architectural conclusions (e.g., 31% of the time in one experiment). We propose a methodology, based on multiple simulations and standard statistical techniques, to compensate for variability. Our methodology greatly reduces the probability of reaching incorrect conclusions, while enabling simulations to finish within reasonable time limits.