Tolerating latency through software-controlled prefetching in shared-memory multiprocessors
Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Efficient algorithms for all-to-all communications in multi-port message-passing systems
SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Increasing memory bandwidth for vector computations
Proceedings of the international conference on Programming languages and system architectures
Access ordering and effective memory bandwidth
Access ordering and effective memory bandwidth
Hitting the memory wall: implications of the obvious
ACM SIGARCH Computer Architecture News
Statistical scalability analysis of communication operations in distributed applications
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Building a high-performance collective communication library
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
BoomerAMG: a parallel algebraic multigrid solver and preconditioner
Applied Numerical Mathematics - Developments and trends in iterative methods for large systems of equations—in memoriam Rüdiger Weiss
Mambo: a full system simulator for the PowerPC architecture
ACM SIGMETRICS Performance Evaluation Review - Special issue on tools for computer architecture research
Direct Cache Access for High Bandwidth Network I/O
Proceedings of the 32nd annual international symposium on Computer Architecture
POWER5 System microarchitecture
IBM Journal of Research and Development - POWER5 and packaging
lmbench: portable tools for performance analysis
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
IEEE Transactions on Computers
Reducing the Impact of the MemoryWall for I/O Using Cache Injection
HOTI '07 Proceedings of the 15th Annual IEEE Symposium on High-Performance Interconnects
Impact of Cache Coherence Protocols on the Processing of Network Traffic
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Characterizing application sensitivity to OS interference using kernel-level noise injection
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
On the Effects of Memory Latency and Bandwidth on Supercomputer Application Performance
IISWC '07 Proceedings of the 2007 IEEE 10th International Symposium on Workload Characterization
Instruction-level simulation of a cluster at scale
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The PERCS High-Performance Interconnect
HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
Open MPI: a flexible high performance MPI
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Affinity-aware DMA buffer management for reducing off-chip memory access
Proceedings of the 27th Annual ACM Symposium on Applied Computing
Characterizing the impact of end-system affinities on the end-to-end performance of high-speed flows
NDM '13 Proceedings of the Third International Workshop on Network-Aware Data Management
Hi-index | 0.00 |
For two decades, the memory wall has affected many applications in their ability to benefit from improvements in processor speed. Cache injection addresses this disparity for I/O by writing data into a processor's cache directly from the I/O bus. This technique reduces data latency and, unlike data prefetching, improves memory bandwidth utilization. These improvements are significant for data-intensive applications whose performance is dominated by compulsory cache misses. We present an empirical evaluation of three injection policies and their effect on the performance of two parallel applications and several collective micro-benchmarks. We demonstrate that the effectiveness of cache injection on performance is a function of the communication characteristics of applications, the injection policy, the target cache, and the severity of the memory wall. For example, we show that injecting message payloads to the L3 cache can improve the performance of network-bandwidth limited applications. In addition, we show that cache injection improves the performance of several collective operations, but not all-to-all operations (implementation dependent). Our study shows negligible pollution to the target caches.