Numerical recipes in C (2nd ed.): the art of scientific computing
Numerical recipes in C (2nd ed.): the art of scientific computing
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Mulitdimensional Streams Rooted in Dataflow
PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Improving Cache Behavior of Dynamically Allocated Data Structures
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Iterative optimization in the polyhedral model: part ii, multidimensional time
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Parallel Computing Experiences with CUDA
IEEE Micro
Direct N-body Kernels for Multicore Platforms
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
Can traditional programming bridge the Ninja performance gap for parallel computing applications?
Proceedings of the 39th Annual International Symposium on Computer Architecture
Boost.SIMD: generic programming for portable SIMDization
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Hi-index | 0.00 |
This paper presents a review of algorithmic transforms called High Level Transforms for IBM, Intel and ARM SIMD multicore processors to accelerate the implementation of low level image processing algorithms. We show that these optimizations provide a significant acceleration. A first evaluation of 512-bit SIMD Xeon- Phi is also presented. We focus on the point that the combination of optimizations leading to the best execution time cannot be predicted, and thus, systematic benchmarking is mandatory. Once the best configuration is found for each architecture, a comparison of these performances is presented. The Harris points detection operator is selected as being representative of low level image processing and computer vision algorithms. Being composed of five convolutions, it is more complex than a simple filter and enables more opportunities to combine optimizations. The presented work can scale across a wide range of codes using 2D stencils and convolutions.