Prefetching using Markov predictors
Proceedings of the 24th annual international symposium on Computer architecture
An analysis of database workload performance on simultaneous multithreaded processors
Proceedings of the 25th annual international symposium on Computer architecture
Exploiting spatial locality in data caches using spatial footprints
Proceedings of the 25th annual international symposium on Computer architecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Predictor-directed stream buffers
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Dynamic hot data stream prefetching for general-purpose programs
PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Using a user-level memory thread for correlation prefetching
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
DBMSs on a Modern Processor: Where Does Time Go?
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling
Proceedings of the 30th annual international symposium on Computer architecture
Scaling and Charact rizing Database Workloads: Bridging the Gap between Research and Practice
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Temporal Streaming of Shared Memory
Proceedings of the 32nd annual international symposium on Computer Architecture
Data Cache Prefetching Using a Global History Buffer
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Accurate and Complexity-Effective Spatial Pattern Prediction
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
DBmbench: fast and accurate database workload representation on modern microarchitecture
CASCON '05 Proceedings of the 2005 conference of the Centre for Advanced Studies on Collaborative research
A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework
Proceedings of the International Symposium on Code Generation and Optimization
Proceedings of the 33rd annual international symposium on Computer Architecture
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Memory Prefetching Using Adaptive Stream Detection
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Mechanisms for store-wait-free multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Accelerating and Adapting Precomputation Threads for Effcient Prefetching
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Temporal instruction fetch streaming
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Stream chaining: exploiting multiple levels of correlation in data prefetching
Proceedings of the 36th annual international symposium on Computer architecture
Identifying hierarchical structure in sequences: a linear-time algorithm
Journal of Artificial Intelligence Research
Stream chaining: exploiting multiple levels of correlation in data prefetching
Proceedings of the 36th annual international symposium on Computer architecture
Machine learning-based prefetch optimization for data center applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Timing local streams: improving timeliness in data prefetching
Proceedings of the 24th ACM International Conference on Supercomputing
Data marshaling for multi-core architectures
Proceedings of the 37th annual international symposium on Computer architecture
A memory interface for multi-purpose multi-stream accelerators
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Data-oriented transaction execution
Proceedings of the VLDB Endowment
PLP: page latch-free shared-everything OLTP
Proceedings of the VLDB Endowment
Global-aware and multi-order context-based prefetching for high-performance processors
International Journal of High Performance Computing Applications
ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Unified memory optimizing architecture: memory subsystem control with a unified predictor
Proceedings of the 26th ACM international conference on Supercomputing
Scalable and dynamically balanced shared-everything OLTP with physiological partitioning
The VLDB Journal — The International Journal on Very Large Data Bases
Linearizing irregular memory accesses for improved correlated prefetching
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
SHIFT: shared history instruction fetch for lean-core server processors
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Large-reach memory management unit caches
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of dependent misses. Spatial memory streaming predicts repetitive data layout patterns within fixed-size memory regions. Because each technique targets a different subset of misses, their effectiveness varies across workloads and each leaves a significant fraction of misses unpredicted. In this paper, we propose Spatio-Temporal Memory Streaming (STeMS) to exploit the synergy between spatial and temporal streaming. We observe that the order of spatial accesses repeats both within and across regions. STeMS records and replays the temporal sequence of region accesses and uses spatial relationships within each region to dynamically reconstruct a predicted total miss order. Using trace-driven and cycle-accurate simulation across a suite of commercial workloads, we demonstrate that with similar implementation complexity as temporal streaming, STeMS achieves equal or higher coverage than spatial or temporal memory streaming alone, and improves performance by 31%, 3%, and 18% over stride, spatial, and temporal prediction, respectively.