Reducing memory latency via non-blocking and prefetching caches
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Improving the cache locality of memory allocation
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Using lifetime predictors to improve memory allocation performance
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
The detection and elimination of useless misses in multiprocessors
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Cache write policies and performance
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Limitations of cache prefetching on a bus-based multiprocessor
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
A performance study of software and hardware data prefetching schemes
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Surpassing the TLB performance of superpages with less operating system support
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Memory system performance of programs with intensive heap allocation
ACM Transactions on Computer Systems (TOCS)
Hitting the memory wall: implications of the obvious
ACM SIGARCH Computer Architecture News
Memory bandwidth limitations of future microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Missing the memory wall: the case for processor/memory integration
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Using the SimOS machine simulator to study complex computer systems
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Speculative execution via address prediction and data prefetching
ICS '97 Proceedings of the 11th international conference on Supercomputing
Memory systems and pipelined processors
Memory systems and pipelined processors
Segregating heap objects by reference behavior and lifetime
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Complete Computer System Simulation: The SimOS Approach
IEEE Parallel & Distributed Technology: Systems & Technology
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
This paper investigates a class of main memory accesses (invalid memory traffic) that can be eliminated altogether. Invalid memory traffic is real data traffic that transfers invalid data. By tracking the initialization of dynamic memory allocations, it is possible to identify store instructions that miss the cache and would fetch uninitialized heap data. The data transfers associated with these initialization misses can be avoided without losing correctness. The memory system property crucial for achieving good performance under heap allocation is cache installation - the ability to allocate and initialize a new object into the cache without a penalty. Tracking heap initialization at a cache block granularity enables cache installation mechanisms to provide zero-latency prefetching into the cache. We propose a hardware mechanism, the Allocation Range Cache, that can efficiently identify initializing store misses to the heap and trigger cache installations to avoid invalid memory traffic.Results: For a 2MB cache 23% of cache misses (35% of compulsory misses) to memory are initializing the heap in the SPEC CINT2000 benchmarks. By using a simple base-bounds range sweeping scheme to track the initialization of the 64 most recent dynamic memory allocations, nearly 100% of all initializing store misses can be identified and installed in cache without accessing memory. Smashing invalid memory traffic via cache installation at a cache block granularity removes 23% of all miss traffic and can provide up to 41% performance improvement.