A model for estimating trace-sample miss ratios
SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Combining Trace Sampling with Single Pass Methods for Efficient Cache Simulation
IEEE Transactions on Computers
Automatically characterizing large scale program behavior
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches
IEEE Transactions on Computers
Reducing State Loss For Effective Trace Sampling of Superscalar Processors
ICCD '96 Proceedings of the 1996 International Conference on Computer Design, VLSI in Computers and Processors
Accuracy and Speedup of Parallel Trace-Driven Architectural Simulation
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling
Proceedings of the 30th annual international symposium on Computer architecture
Minimal Subset Evaluation: Rapid Warm-Up for Simulated Hardware State
ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
Picking Statistically Valid and Early Simulation Points
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
TurboSMARTS: accurate microarchitecture simulation sampling in minutes
SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
BLRL: Accurate and Efficient Warmup for Sampled Processor Simulation
The Computer Journal
Memory reference reuse latency: Accelerated warmup for sampled microarchitecture simulation
ISPASS '03 Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software
Yet shorter warmup by combining no-state-loss and MRRL for sampled LRU cache simulation
Journal of Systems and Software - Special issue: Quality software
Accelerating Multiprocessor Simulation with a Memory Timestamp Record
ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
Efficient sampling startup for sampled processor simulation
HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
Branch Predictor Warmup for Sampled Simulation through Branch History Matching
Transactions on High-Performance Embedded Architectures and Compilers II
Branch history matching: branch predictor warmup for sampled simulation
HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
Workload generation for microprocessor performance evaluation: SPEC PhD award (invited abstract)
ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Hi-index | 0.00 |
Architectural simulation is extremely time-consuming given the huge number of instructions that need to be simulated for contemporary benchmarks. Sampled simulation which selects a number of samples from the complete benchmark execution yields substantial speedups. However, there is one major issue that needs to be dealt with in order to minimize non-sampling bias, namely the hardware state at the beginning of each sample. This is well known in the literature as the cold-start problem. The hardware structures that suffer the most from the cold-start problem are cache hierarchies. In this paper we propose NSL-BLRL which combines two previously proposed cache hierarchy warmup approaches, namely No-State-Loss (NSL) and Boundary Line Reuse Latency (BLRL). The idea of NSL-BLRL is to warmup the cache hierarchy using a hardware state checkpoint that stores a truncated NSL stream. The NSL stream is a leastrecently used stream of (unique) memory references in the pre-sample. This NSL stream is then truncated to form the NSL-BLRL warmup checkpoint; this is done by inspecting the sample for determining how far in the pre-sample one needs to go back to accurately warmup the hardware state for the given sample. We show using SPEC CPU2000 benchmarks that NSL-BLRL is (i) nearly as accurate as BLRL and NSL for sampled processor simulation, (ii) yields simulation time speedups of several orders of magnitude compared to BLRL, and (iii) is more space-efficient than NSL. As such, we conclude that NSL-BLRL is a highly efficient and accurate cache warmup strategy for sampled processor simulation.