A software-based dynamic-warp scheduling approach for load-balancing the Viola-Jones face detection algorithm on GPUs

Authors:
Tan Nguyen;Daniel Hefenbrock;Jason Oberg;Ryan Kastner;Scott Baden
Affiliations:
-;-;-;-;-
Venue:
Journal of Parallel and Distributed Computing
Year:
2013

Citing 9
Cited 0

Neural Network-Based Face Detection

IEEE Transactions on Pattern Analysis and Machine Intelligence
A General Framework for Object Detection

ICCV '98 Proceedings of the Sixth International Conference on Computer Vision
A Parallel Architecture for Hardware Face Detection

ISVLSI '06 Proceedings of the IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures
An FPGA-based people detection system

EURASIP Journal on Applied Signal Processing
Fpga-based face detection system using Haar classifiers

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Parallelized Architecture of Multiple Classifiers for Face Detection

ASAP '09 Proceedings of the 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors
Dynamic warp subdivision for integrated branch and memory divergence tolerance

Proceedings of the 37th annual international symposium on Computer architecture
Accelerating Viola-Jones Face Detection to FPGA-Level Using GPUs

FCCM '10 Proceedings of the 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines
Mint: realizing CUDA performance in 3D stencil methods with annotated C

Proceedings of the international conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Face detection is a key component in applications such as security surveillance and human-computer interaction systems, and real-time recognition is essential in many scenarios. The Viola-Jones algorithm is an attractive means of meeting the real time requirement, and has been widely implemented on custom hardware, FPGAs and GPUs. We demonstrate a GPU implementation that achieves competitive performance, but with low development costs. Our solution treats the irregularity inherent to the algorithm using a novel dynamic warp scheduling approach that eliminates thread divergence. This new scheme also employs a thread pool mechanism, which significantly alleviates the cost of creating, switching, and terminating threads. Compared to static thread scheduling, our dynamic warp scheduling approach reduces the execution time by a factor of 3. To maximize detection throughput, we also run on multiple GPUs, realizing 95.6 FPS on 5 Fermi GPUs.