Cache injection for parallel applications

Authors:
Edgar A. León;Rolf Riesen;Kurt B. Ferreira;Arthur B. Maccabe
Affiliations:
IBM Research, Austin, TX, USA;IBM Research, Dublin, Ireland;Sandia National Laboratories, Albuquerque, NM, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA
Venue:
Proceedings of the 20th international symposium on High performance distributed computing
Year:
2011

Citing 24
Cited 2

Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Efficient algorithms for all-to-all communications in multi-port message-passing systems

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Increasing memory bandwidth for vector computations

Proceedings of the international conference on Programming languages and system architectures
Access ordering and effective memory bandwidth

Access ordering and effective memory bandwidth
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
Statistical scalability analysis of communication operations in distributed applications

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Building a high-performance collective communication library

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
BoomerAMG: a parallel algebraic multigrid solver and preconditioner

Applied Numerical Mathematics - Developments and trends in iterative methods for large systems of equations—in memoriam Rüdiger Weiss
TCP Onloading for Data Center Servers

Computer
Mambo: a full system simulator for the PowerPC architecture

ACM SIGMETRICS Performance Evaluation Review - Special issue on tools for computer architecture research
Direct Cache Access for High Bandwidth Network I/O

Proceedings of the 32nd annual international symposium on Computer Architecture
Experience with K42, an open-source, Linux-compatible, scalable operating-system kernel

IBM Systems Journal
POWER5 System microarchitecture

IBM Journal of Research and Development - POWER5 and packaging
lmbench: portable tools for performance analysis

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
On the Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications

IEEE Transactions on Computers
Reducing the Impact of the MemoryWall for I/O Using Cache Injection

HOTI '07 Proceedings of the 15th Annual IEEE Symposium on High-Performance Interconnects
Impact of Cache Coherence Protocols on the Processing of Network Traffic

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Characterizing application sensitivity to OS interference using kernel-level noise injection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
On the Effects of Memory Latency and Bandwidth on Supercomputer Application Performance

IISWC '07 Proceedings of the 2007 IEEE 10th International Symposium on Workload Characterization
Instruction-level simulation of a cluster at scale

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The PERCS High-Performance Interconnect

HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
Open MPI: a flexible high performance MPI

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Comparing direct-to-cache transfer policies to TCP/IP and M-VIA during receive operations in MPI environments

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications

Affinity-aware DMA buffer management for reducing off-chip memory access

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Characterizing the impact of end-system affinities on the end-to-end performance of high-speed flows

NDM '13 Proceedings of the Third International Workshop on Network-Aware Data Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

For two decades, the memory wall has affected many applications in their ability to benefit from improvements in processor speed. Cache injection addresses this disparity for I/O by writing data into a processor's cache directly from the I/O bus. This technique reduces data latency and, unlike data prefetching, improves memory bandwidth utilization. These improvements are significant for data-intensive applications whose performance is dominated by compulsory cache misses. We present an empirical evaluation of three injection policies and their effect on the performance of two parallel applications and several collective micro-benchmarks. We demonstrate that the effectiveness of cache injection on performance is a function of the communication characteristics of applications, the injection policy, the target cache, and the severity of the memory wall. For example, we show that injecting message payloads to the L3 cache can improve the performance of network-bandwidth limited applications. In addition, we show that cache injection improves the performance of several collective operations, but not all-to-all operations (implementation dependent). Our study shows negligible pollution to the target caches.