Trace-driven memory simulation: a survey
ACM Computing Surveys (CSUR)
Stack Evaluation of Arbitrary Set-Associative Multiprocessor Caches
IEEE Transactions on Parallel and Distributed Systems
Using modern graphics architectures for general-purpose computing: a framework and analysis
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Analysis of cache replacement-algorithms
Analysis of cache replacement-algorithms
Time-parallel simulation with approximative state matching
Proceedings of the eighteenth workshop on Parallel and distributed simulation
Approximate time-parallel cache simulation
WSC '04 Proceedings of the 36th conference on Winter simulation
An efficient single-pass trace compression technique utilizing instruction streams
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Linear algebra operators for GPU implementation of numerical algorithms
SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Locality-improved FFT implementation on a graphics processor
ISCGAV'07 Proceedings of the 7th WSEAS International Conference on Signal Processing, Computational Geometry & Artificial Vision
Fast scan algorithms on graphics processors
Proceedings of the 22nd annual international conference on Supercomputing
A game loop architecture for the GPU used as a math coprocessor in real-time applications
Computers in Entertainment (CIE) - SPECIAL ISSUE: Media Arts
Real-time Reyes-style adaptive surface subdivision
ACM SIGGRAPH Asia 2008 papers
Evaluation techniques for storage hierarchies
IBM Systems Journal
A survey on cache tuning from a power/energy perspective
ACM Computing Surveys (CSUR)
Hi-index | 0.00 |
We describe the design of parallel trace-driven cache simulation for the purposes of evaluating different cache structures. As the research goes deeper, traditional simulation methods, which can only execute simulation operations in sequence, are no longer practical due to their long simulation cycles. An obvious way to achieve fast parallel simulation is to simulate the independent sets of a cache concurrently on different compute resources. We considered the use of generic GPU to accelerate cache simulation which exploits set-partitioning as the main source of parallelism. But we show this technique is not efficient in the case that just simulating one cache configuration, since a high correlation of the activity between different sets. Trace-sort and multi-configuration simulation in one single pass techniques are developed, taking advantage of the full programmability offered by the Compute Unified Device Architecture (CUDA) on the GPU. Our experimental results demonstrate that the cache simulator based on GPU-CPU platform gains 2.44x performance improvement compared to traditional sequential algorithm.