Accelerating data movement on future chip multi-processors

Authors:
Junli Gu;Rakesh Kumar;Steven S. Lumetta;Yihe Sun
Affiliations:
Tsinghua University, Beijing, China and University of Illinois at Urbana-Champaign, Illinois;University of Illinois at Urbana-Champaign, Illinois;University of Illinois at Urbana-Champaign, Illinois;Tsinghua University, Beijing, China
Venue:
Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
Year:
2010

Citing 9
Cited 0

Lightweight remote procedure call

ACM Transactions on Computer Systems (TOCS)
Munin: distributed shared memory based on type-specific memory coherence

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Tolerating latency through software-controlled data prefetching

Tolerating latency through software-controlled data prefetching
Data Forwarding in Scalable Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Multi-protocol active messages on a cluster of SMP's

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Design and evaluation of multiprotocol communication on a cluster of smp's

Design and evaluation of multiprotocol communication on a cluster of smp's
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Data Transfers between Processes in an SMP System: Performance Study and Application to MPI

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Comparing memory systems for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Moving data between cores on hardware coherent architectures suffers from memory latency and causes cache misses and coherence traffic, which are obstacles to achieving high performance. In this paper, we evaluate the potential for hardware optimization of message data transfer on chip multiprocessors with a combination of NAS parallel MPI benchmarks, Intel IMB MPI benchmarks, and a few microbenchmarks on a full-system simulator based on Simics and FeS2. We show that while passive hardware driven by cores can reduce cache traffic, it provides limited performance gains. We propose a data movement manager (DMM) that uses the on-chip coherence protocols to implement zero-copy message passing between separate address spaces and to remove synchronization and copy overheads from the processors. We also discuss methods for managing data placement in caches to reduce latency. We show that such a design shows substantial promise for both cache traffic reduction and performance improvements.