Memory reference reuse latency: Accelerated warmup for sampled microarchitecture simulation

Authors:
R. S. Burugula;K. Skadron
Affiliations:
Center for Comput. Sci., Bowie, MD, USA;IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
Venue:
ISPASS '03 Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software
Year:
2003

Citing 0
Cited 21

Challenges in Computer Architecture Evaluation

Computer
EXPERT: expedited simulation exploiting program behavior repetition

Proceedings of the 18th annual international conference on Supercomputing
Efficient simulation of trace samples on parallel machines

Parallel Computing
How to use SimPoint to pick simulation points

ACM SIGMETRICS Performance Evaluation Review - Special issue on tools for computer architecture research
Accelerated warmup for sampled microarchitecture simulation

ACM Transactions on Architecture and Code Optimization (TACO)
Optimal sample length for efficient cache simulation

Journal of Systems Architecture: the EUROMICRO Journal
Simulation of Computer Architectures: Simulators, Benchmarks, Methodologies, and Recommendations

IEEE Transactions on Computers
SMA: a self-monitored adaptive cache warm-up scheme for microprocessor simulation

International Journal of Parallel Programming
Statistical sampling of microarchitecture simulation

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Efficiently exploring architectural design spaces via predictive modeling

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Yet shorter warmup by combining no-state-loss and MRRL for sampled LRU cache simulation

Journal of Systems and Software - Special issue: Quality software
The Future of Simulation: A Field of Dreams

Computer
Evaluating trace cache energy efficiency

ACM Transactions on Architecture and Code Optimization (TACO)
NSL-BLRL: Efficient CacheWarmup for Sampled Processor Simulation

ANSS '06 Proceedings of the 39th annual Symposium on Simulation
Efficient architectural design space exploration via predictive modeling

ACM Transactions on Architecture and Code Optimization (TACO)
Analysing and improving clustering based sampling for microprocessor simulation

International Journal of High Performance Computing and Networking
Branch Predictor Warmup for Sampled Simulation through Branch History Matching

Transactions on High-Performance Embedded Architectures and Compilers II
Branch history matching: branch predictor warmup for sampled simulation

HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
Statistical sampling of microarchitecture simulation

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Efficient sampling startup for sampled processor simulation

HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
Warm-Up Simulation Methodology for HW/SW Co-Designed Processors

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes to speedup sampled microprocessor simulations by reducing warmup times without sacrificing simulation accuracy. It exploiting the observation that of the memory references that precede a sample cluster, references that occur nearest to the cluster are more likely to be germane to the execution of the cluster itself. Hence, while modeling all cache and branch predictor interactions that precede a sample cluster would reliably establish their state, this is overkill and leads to long-running simulations. Instead, accurately establishing simulated cache and branch predictor state can be accomplished quickly by only modeling a subset of the memory references and control-flow instructions immediately preceding a sample cluster. Our technique measures memory reference reuse latencies (MRRLs) - the number of completed instructions between consecutive references to each unique memory location - and uses these data to choose a point prior to each cluster to engage cache hierarchy and branch predictor modeling. By starting cache and branch predictor modeling late in the pre-cluster instruction stream, we were able to reduce overall simulation running times by an average of 90.62% of the maximum potential speedup (accomplished by performing no pre-cluster warmup at all), while generating an average error in IPC of less than 1%, both relative to the IPC generated by warming up all pre-cluster cache and branch predictor interactions.