NSL-BLRL: Efficient CacheWarmup for Sampled Processor Simulation

Authors:
Luk Van Ertvelde;Filip Hellebaut;Lieven Eeckhout;Koen De Bosschere
Affiliations:
Ghent University, Belgium;Ghent Universtiy, Belgium;Ghent Universtiy, Belgium;Ghent University, Belgium
Venue:
ANSS '06 Proceedings of the 39th annual Symposium on Simulation
Year:
2006

Citing 15
Cited 3

A model for estimating trace-sample miss ratios

SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Combining Trace Sampling with Single Pass Methods for Efficient Cache Simulation

IEEE Transactions on Computers
Automatically characterizing large scale program behavior

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches

IEEE Transactions on Computers
Reducing State Loss For Effective Trace Sampling of Superscalar Processors

ICCD '96 Proceedings of the 1996 International Conference on Computer Design, VLSI in Computers and Processors
Accuracy and Speedup of Parallel Trace-Driven Architectural Simulation

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling

Proceedings of the 30th annual international symposium on Computer architecture
Minimal Subset Evaluation: Rapid Warm-Up for Simulated Hardware State

ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
Picking Statistically Valid and Early Simulation Points

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
TurboSMARTS: accurate microarchitecture simulation sampling in minutes

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
BLRL: Accurate and Efficient Warmup for Sampled Processor Simulation

The Computer Journal
Memory reference reuse latency: Accelerated warmup for sampled microarchitecture simulation

ISPASS '03 Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software
Yet shorter warmup by combining no-state-loss and MRRL for sampled LRU cache simulation

Journal of Systems and Software - Special issue: Quality software
Accelerating Multiprocessor Simulation with a Memory Timestamp Record

ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
Efficient sampling startup for sampled processor simulation

HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers

Branch Predictor Warmup for Sampled Simulation through Branch History Matching

Transactions on High-Performance Embedded Architectures and Compilers II
Branch history matching: branch predictor warmup for sampled simulation

HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
Workload generation for microprocessor performance evaluation: SPEC PhD award (invited abstract)

ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Architectural simulation is extremely time-consuming given the huge number of instructions that need to be simulated for contemporary benchmarks. Sampled simulation which selects a number of samples from the complete benchmark execution yields substantial speedups. However, there is one major issue that needs to be dealt with in order to minimize non-sampling bias, namely the hardware state at the beginning of each sample. This is well known in the literature as the cold-start problem. The hardware structures that suffer the most from the cold-start problem are cache hierarchies. In this paper we propose NSL-BLRL which combines two previously proposed cache hierarchy warmup approaches, namely No-State-Loss (NSL) and Boundary Line Reuse Latency (BLRL). The idea of NSL-BLRL is to warmup the cache hierarchy using a hardware state checkpoint that stores a truncated NSL stream. The NSL stream is a leastrecently used stream of (unique) memory references in the pre-sample. This NSL stream is then truncated to form the NSL-BLRL warmup checkpoint; this is done by inspecting the sample for determining how far in the pre-sample one needs to go back to accurately warmup the hardware state for the given sample. We show using SPEC CPU2000 benchmarks that NSL-BLRL is (i) nearly as accurate as BLRL and NSL for sampled processor simulation, (ii) yields simulation time speedups of several orders of magnitude compared to BLRL, and (iii) is more space-efficient than NSL. As such, we conclude that NSL-BLRL is a highly efficient and accurate cache warmup strategy for sampled processor simulation.