ACM Transactions on Computer Systems (TOCS)
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimal broadcast and summation in the LogP model
SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
The communication challenge for MPP: Intel Paragon and Meiko CS-2
Parallel Computing
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
LoGPC: Modeling Network Contention in Message-Passing Programs
IEEE Transactions on Parallel and Distributed Systems
Fast Measurement of LogP Parameters for Message Passing Platforms
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Quantifying Locality Effect in Data Access Delay: Memory logP
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A Survey of Parallel Algorithms for Shared-Memory Machines
A Survey of Parallel Algorithms for Shared-Memory Machines
A Better x86 Memory Model: x86-TSO
TPHOLs '09 Proceedings of the 22nd International Conference on Theorem Proving in Higher Order Logics
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Two-tree algorithms for full bandwidth broadcast, reduction and scan
Parallel Computing
Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A bridging model for multi-core computing
Journal of Computer and System Sciences
High-performance RMA-based broadcast on the intel SCC
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Optimization principles for collective neighborhood communications
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Proceedings of the 5th ACM/SPEC international conference on Performance engineering
Hi-index | 0.00 |
Most multi-core and some many-core processors implement cache coherency protocols that heavily complicate the design of optimal parallel algorithms. Communication is performed implicitly by cache line transfers between cores, complicating the understanding of performance properties. We developed an intuitive performance model for cache-coherent architectures and demonstrate its use with the currently most scalable cache-coherent many-core architecture, Intel Xeon Phi. Using our model, we develop several optimal and optimized algorithms for complex parallel data exchanges. All algorithms that were developed with the model beat the performance of the highly-tuned vendor-specific Intel OpenMP and MPI libraries by up to a factor of 4.3. The model can be simplified to satisfy the tradeoff between complexity of algorithm design and accuracy. We expect that our model can serve as a vehicle for advanced algorithm design.