Continuous profiling: where have all the cycles gone?
ACM Transactions on Computer Systems (TOCS)
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
Automatic Panoramic Image Stitching using Invariant Features
International Journal of Computer Vision
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Predictive Runtime Code Scheduling for Heterogeneous Architectures
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
GPU acceleration of a production molecular docking code
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
SD-VBS: The San Diego Vision Benchmark Suite
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Performance characterization and optimization of mobile augmented reality on handheld platforms
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Proceedings of the 24th ACM International Conference on Supercomputing
ATI Stream Profiler: a tool to optimize an OpenCL kernel on ATI Radeon GPUs
ACM SIGGRAPH 2010 Posters
Accelerating S3D: a GPGPU case study
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
SURF: speeded up robust features
ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part I
Enabling task-level scheduling on heterogeneous platforms
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Many developers have begun to realize that heterogeneous multi-core and many-core computer systems can provide significant performance opportunities to a range of applications. Typical applications possess multiple components that can be parallelized; developers need to be equipped with proper performance tools to analyze program flow and identify application bottlenecks. In this paper, we analyze and profile the components of the Speeded Up Robust Features (SURF) Computer Vision algorithm written in OpenCL. Our profiling framework is developed using built-in OpenCL API function calls, without the need for an external profiler. We show we can begin to identify performance bottlenecks and performance issues present in individual components on different hardware platforms. We demonstrate that by using run-time profiling using the OpenCL specification, we can provide an application developer with a fine-grained look at performance, and that this information can be used to tailor performance improvements for specific platforms.