Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi

Authors:
Sabela Ramos;Torsten Hoefler
Affiliations:
University of A Coruña, A Coruña, Spain;ETH Zurich, Zurich, Switzerland
Venue:
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Year:
2013

Citing 18
Cited 2

An analytical cache model

ACM Transactions on Computer Systems (TOCS)
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimal broadcast and summation in the LogP model

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
The communication challenge for MPP: Intel Paragon and Meiko CS-2

Parallel Computing
LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
LoGPC: Modeling Network Contention in Message-Passing Programs

IEEE Transactions on Parallel and Distributed Systems
Fast Measurement of LogP Parameters for Message Passing Platforms

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Quantifying Locality Effect in Data Access Delay: Memory logP

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A Survey of Parallel Algorithms for Shared-Memory Machines

A Survey of Parallel Algorithms for Shared-Memory Machines
$\log_{\rm n}{\rm P}$ and $\log_{3}{\rm P}$: Accurate Analytical Models of Point-to-Point Communication in Distributed Systems

IEEE Transactions on Computers
A Better x86 Memory Model: x86-TSO

TPHOLs '09 Proceedings of the 22nd International Conference on Theorem Proving in Higher Order Logics
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Two-tree algorithms for full bandwidth broadcast, reduction and scan

Parallel Computing
Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A bridging model for multi-core computing

Journal of Computer and System Sciences
High-performance RMA-based broadcast on the intel SCC

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Optimization principles for collective neighborhood communications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Accurate prediction of the behavior of multithreaded applications in shared caches

Parallel Computing

MVAPICH-PRISM: a proxy-based communication framework using InfiniBand and SCIF for intel MIC clusters

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Test-driving Intel Xeon Phi

Proceedings of the 5th ACM/SPEC international conference on Performance engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most multi-core and some many-core processors implement cache coherency protocols that heavily complicate the design of optimal parallel algorithms. Communication is performed implicitly by cache line transfers between cores, complicating the understanding of performance properties. We developed an intuitive performance model for cache-coherent architectures and demonstrate its use with the currently most scalable cache-coherent many-core architecture, Intel Xeon Phi. Using our model, we develop several optimal and optimized algorithms for complex parallel data exchanges. All algorithms that were developed with the model beat the performance of the highly-tuned vendor-specific Intel OpenMP and MPI libraries by up to a factor of 4.3. The model can be simplified to satisfy the tradeoff between complexity of algorithm design and accuracy. We expect that our model can serve as a vehicle for advanced algorithm design.