Designing a Modern Memory Hierarchy with Hardware Prefetching
IEEE Transactions on Computers
Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Dynamic Cluster Resource Allocations for Jobs with Known and Unknown Memory Demands
IEEE Transactions on Parallel and Distributed Systems
Using the Compiler to Improve Cache Replacement Decisions
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Efficient use of memory bandwidth to improve network processor throughput
Proceedings of the 30th annual international symposium on Computer architecture
Guided region prefetching: a cooperative hardware/software approach
Proceedings of the 30th annual international symposium on Computer architecture
Design and Optimization of Large Size and Low Overhead Off-Chip Caches
IEEE Transactions on Computers
Memory Controller Optimizations for Web Servers
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Eliminating Conflict Misses Using Prime Number-Based Cache Indexing
IEEE Transactions on Computers
A study of performance impact of memory controller features in multi-processor server environment
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Register saturation in instruction level parallelism
International Journal of Parallel Programming
Exploiting locality to ameliorate packet queue contention and serialization
Proceedings of the 3rd conference on Computing frontiers
The bit-reversal SDRAM address mapping
SCOPES '05 Proceedings of the 2005 workshop on Software and compilers for embedded systems
Overlapping dependent loads with addressless preload
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Victim management in a cache hierarchy
IBM Journal of Research and Development - Advanced silicon technology
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Virtually Pipelined Network Memory
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Memory Prefetching Using Adaptive Stream Detection
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive insertion policies for high performance caching
Proceedings of the 34th annual international symposium on Computer architecture
Near-Memory Caching for Improved Energy Consumption
IEEE Transactions on Computers
Data access history cache and associated data prefetching mechanisms
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Optimizing thread throughput for multithreaded workloads on memory constrained CMPs
Proceedings of the 5th conference on Computing frontiers
Designing packet buffers for router linecards
IEEE/ACM Transactions on Networking (TON)
Guided Prefetching Based on Runtime Access Patterns
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
DRAM is plenty fast for wirespeed statistics counting
ACM SIGMETRICS Performance Evaluation Review
Prefetch-Aware DRAM Controllers
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Mini-rank: Adaptive DRAM architecture for improving memory power efficiency
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices
Proceedings of the 36th annual international symposium on Computer architecture
Proceedings of the 36th annual international symposium on Computer architecture
Comprehensive cache performance tuning with a toolset
Future Generation Computer Systems
High-bandwidth network memory system through virtual pipelines
IEEE/ACM Transactions on Networking (TON)
Multiprocessor System-on-Chip designs with active memory processors for higher memory efficiency
Proceedings of the 46th Annual Design Automation Conference
A concurrent dynamic analysis framework for multicore hardware
Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
COMPASS: a programmable data prefetcher using idle GPU shaders
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Dynamic cluster resource allocations for jobs with known memory demands
Proceedings of the International Conference and Workshop on Emerging Trends in Technology
Early control of register pressure for software pipelined loops
CC'03 Proceedings of the 12th international conference on Compiler construction
Dynamic voltage and frequency scaling for scientific applications
LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
The virtual write queue: coordinating DRAM and last-level cache policies
Proceedings of the 37th annual international symposium on Computer architecture
Design and analysis of a robust pipelined memory system
INFOCOM'10 Proceedings of the 29th conference on Information communications
Software-hardware cooperative DRAM bank partitioning for chip multiprocessors
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Design and performance analysis of a DRAM-based statistics counter array architecture
Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Cache-conscious data placement in an in-memory key-value store
Proceedings of the 15th Symposium on International Database Engineering & Applications
Processor directed dynamic page policy
ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
PACMan: prefetch-aware cache management for high performance caching
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Reducing off-chip memory traffic by selective cache management scheme in GPGPUs
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
SAD prefetching for MPEG4 using flux caches
SAMOS'06 Proceedings of the 6th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
DRAM-based statistics counter array architecture with performance guarantee
IEEE/ACM Transactions on Networking (TON)
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation
Proceedings of the 40th Annual International Symposium on Computer Architecture
Return data interleaving for multi-channel embedded CMPs systems
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Meeting midway: improving CMP performance with memory-side prefetching
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Hi-index | 0.01 |
Abstract: In this paper,we address the severe performance gap caused by high processor clock rates and slow DRAM accesses.We show that even with an aggressive,next-generation memory system using four Direct Rambus channels and an integrated one-megabyte level-two cache,a processor still spends over half of its time stalling for L2 misses.Large cache blocks can improve performance,but only when coupled with wide memory channels.DRAM address mappings also affect performance significantly.We evaluate an aggressive prefetch unit integrated with the L2 cache and memory controllers.By issuing prefetches only when the Rambus channels are idle,prioritizing them to maximize DRAM row buffer hits,and giving them low replacement priority,we achieve a 43% speedup across 10 of the 26 SPEC2000 benchmarks,without degrading performance on the others.With eight Rambus channels,these ten benchmarks improve to within 10% of the performance of a perfect L2 cache.