Power-performance considerations of parallel computing on chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications
IEEE Transactions on Parallel and Distributed Systems
Computer Science - Research and Development
Flexible workload generation for HPC cluster efficiency benchmarking
Computer Science - Research and Development
Hi-index | 0.00 |
In this paper, we present three performance and energy case studies of benchmark applications in the OmpSs environment for task based programming. Different parallel and vectorized implementations are evaluated on an Intel® CoreTMi7-2600 quad-core processor. Using FLOPS/W derived from chip MSR registers, we find AVX code to be clearly most energy efficient in general. The peak on-chip GFLOPS/W rates are: Black-Scholes (BS) 0.89, FFTW 1.38 and Matrix Multiply (MM) 1.97. Experiments cover variable degrees of thread parallelism and different OmpSs task pool scheduling policies. We find that maximum energy efficiency for small and medium sized problems is obtained by limiting the number of parallel threads. Comparison of AVX variants with non-vectorized code shows ≈6−7 × (BS) and ≈3−5 × (FFTW) improvements in on-chip energy efficiency, depending on the problem size and degree of multithreading.