Memory Performance And SPEC OpenMP scalability on quad-socket x86 64 systems
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
Profile-guided deployment of stream programs on multicores
Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Sparse matrix-vector multiply on the HICAMP architecture
Proceedings of the 26th ACM international conference on Supercomputing
Expression Templates Revisited: A Performance Analysis of Current Methodologies
SIAM Journal on Scientific Computing
Review: Energy-aware performance analysis methodologies for HPC architectures-An exploratory study
Journal of Network and Computer Applications
Patus for convenient high-performance stencils: evaluation in earthquake simulations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
WHPCF '13 Proceedings of the 6th Workshop on High Performance Computational Finance
Fine-grained Benchmark Subsetting for System Selection
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Proceedings of Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms
Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Hi-index | 0.00 |
Exploiting the performance of today's processors requires intimate knowledge of the microarchitecture as well as an awareness of the ever-growing complexity in thread and cache topology. LIKWID is a set of command-line utilities that addresses four key problems: Probing the thread and cache topology of a shared-memory node, enforcing thread-core affinity on a program, measuring performance counter metrics, and toggling hardware prefetchers. An API for using the performance counting features from user code is also included. We clearly state the differences to the widely used PAPI interface. To demonstrate the capabilities of the tool set we show the influence of thread pinning on performance using the well-known OpenMP STREAM triad benchmark, and use the affinity and hardware counter tools to study the performance of a stencil code specifically optimized to utilize shared caches on multicore chips.