Stride directed prefetching in scalar processors
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Proceedings of the 27th annual international symposium on Computer architecture
AC/DC: An Adaptive Data Cache Prefetcher
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Data Cache Prefetching Using a Global History Buffer
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Proceedings of the 33rd annual international symposium on Computer Architecture
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive insertion policies for high performance caching
Proceedings of the 34th annual international symposium on Computer architecture
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Adaptive insertion policies for managing shared caches
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Prefetch-Aware DRAM Controllers
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Less reused filter: improving l2 cache performance via filtering less reused lines
Proceedings of the 23rd international conference on Supercomputing
Access map pattern matching for data cache prefetch
Proceedings of the 23rd international conference on Supercomputing
Spatio-temporal memory streaming
Proceedings of the 36th annual international symposium on Computer architecture
Stream chaining: exploiting multiple levels of correlation in data prefetching
Proceedings of the 36th annual international symposium on Computer architecture
Micro-pages: increasing DRAM efficiency with locality-aware data placement
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
High performance cache replacement using re-reference interval prediction (RRIP)
Proceedings of the 37th annual international symposium on Computer architecture
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
DRAMSim2: A Cycle Accurate Memory System Simulator
IEEE Computer Architecture Letters
MARSS: a full system simulator for multicore x86 CPUs
Proceedings of the 48th Design Automation Conference
Minimalist open-page: a DRAM page-mode scheduling policy for the many-core era
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
PACMan: prefetch-aware cache management for high performance caching
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Data prefetching, advanced cache replacement policy, and memory access scheduling are incorporated in modern processors. Typically, each technique holds recently accessed locations independently and controls the memory subsystem based on the prediction of future memory access. Unfortunately, these specific optimizations often increase the implementation cost, decrease the system performance, and reduce scalability of the processor chip. In this paper, we propose Unified Memory Optimizing (UMO) architecture to resolve these problems. The UMO architecture is a control architecture for the memory subsystem and takes a unified approach to data prefetching, cache management, and memory access scheduling. On this architecture, we propose a Map-based Unified Memory Subsystem Controller (MUMSC) that is composed of DRAM-Aware prefetching, Prefetch-Aware Cache Line Promotion, and lightweight memory controllers. MUMSC is implemented as the per-core resource to predict future memory access from the per-core memory access history. MUMSC realizes a scalable and high performance memory subsystem with a reasonable hardware cost. We evaluate MUMSC using a multi-core simulator with multi-programmed workloads of SPEC CPU2006. The results of the simulation show that the system throughput using MUMSC outperforms a combination of state-of-the-art enhancement techniques by 11.5% without increasing the hardware costs and the complexity of the design of the shared resources.