Neural Network-Based Face Detection
IEEE Transactions on Pattern Analysis and Machine Intelligence
A General Framework for Object Detection
ICCV '98 Proceedings of the Sixth International Conference on Computer Vision
A Parallel Architecture for Hardware Face Detection
ISVLSI '06 Proceedings of the IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures
An FPGA-based people detection system
EURASIP Journal on Applied Signal Processing
Fpga-based face detection system using Haar classifiers
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Parallelized Architecture of Multiple Classifiers for Face Detection
ASAP '09 Proceedings of the 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors
Dynamic warp subdivision for integrated branch and memory divergence tolerance
Proceedings of the 37th annual international symposium on Computer architecture
Accelerating Viola-Jones Face Detection to FPGA-Level Using GPUs
FCCM '10 Proceedings of the 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines
Mint: realizing CUDA performance in 3D stencil methods with annotated C
Proceedings of the international conference on Supercomputing
Hi-index | 0.00 |
Face detection is a key component in applications such as security surveillance and human-computer interaction systems, and real-time recognition is essential in many scenarios. The Viola-Jones algorithm is an attractive means of meeting the real time requirement, and has been widely implemented on custom hardware, FPGAs and GPUs. We demonstrate a GPU implementation that achieves competitive performance, but with low development costs. Our solution treats the irregularity inherent to the algorithm using a novel dynamic warp scheduling approach that eliminates thread divergence. This new scheme also employs a thread pool mechanism, which significantly alleviates the cost of creating, switching, and terminating threads. Compared to static thread scheduling, our dynamic warp scheduling approach reduces the execution time by a factor of 3. To maximize detection throughput, we also run on multiple GPUs, realizing 95.6 FPS on 5 Fermi GPUs.