Evaluating the performance of four snooping cache coherency protocols
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
MemorIES3: a programmable, real-time hardware emulation tool for multiprocessor server design
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Dead-block prediction & dead-block correlating prefetchers
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
A low-overhead coherence solution for multiprocessors with private cache memories
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Memory System Behavior of Java-Based Middleware
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Reconfigurable Address Collector and Flying Cache Simulator
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
SBAC-PAD '02 Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing
Large scale Itanium® 2 processor OLTP workload characterization and optimization
DaMoN '06 Proceedings of the 2nd international workshop on Data management on new hardware
Using GPU to accelerate a pin-based multi-level cache simulator
SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
Hi-index | 0.00 |
With the proliferation of e-businesses, Java™ Middleware and OLTP applications are gaining importance. As the gap between CPU and memory latencies continues to increase, the performance of these applications running on multiprocessor systems will become further limited by the memory system. This study characterizes the memory behavior of such applications using the SPECjAppServer2002 and TPC-C benchmarks running on a real multiprocessor system. More specifically, the shared and private L3 caches with invalidation- and update-based coherence protocols are evaluated using the Programmable Hardware-Assisted Cache Emulator (PHA$E). We found that coherency misses increase with larger private L3 caches, constituting up to more than 15% of all misses for both benchmarks. Additionally, a saturation point was observed at which employing larger private cache yields no further improvement in miss ratio. Conversely, the shared L3 cache design was observed to be more scalable since it does not suffer from coherence misses. Our limit study shows that the existing Write-Broadcast policy, which updates line copies in other caches during a write on a shared line, has the potential to simultaneously reduce private cache miss ratio and bus traffic. For example, at 64MB, it reduces the miss ratio by 53% and 44% respectively for SPECjAppServer2002 and TPC-C, while lowering the bus traffic by 18% and 11%. In overall, the policy can eliminate the aforementioned saturation point and allows private cache miss ratio that is comparable with the miss ratio of a shared cache.