Iterative modulo scheduling: an algorithm for software pipelining loops
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Neural Network-Based Face Detection
IEEE Transactions on Pattern Analysis and Machine Intelligence
A bandwidth-efficient architecture for media processing
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Space-time scheduling of instruction-level parallelism on a raw machine
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Boosting beyond static scheduling in a superscalar processor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The FERET Evaluation Methodology for Face-Recognition Algorithms
IEEE Transactions on Pattern Analysis and Machine Intelligence
ILP-based Instruction Scheduling for IA-64
OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Computer and Robot Vision
CALiBeR: a software pipelining algorithm for clustered embedded VLIW processors
Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
Discriminant Analysis of Principal Components for Face Recognition
FG '98 Proceedings of the 3rd. International Conference on Face & Gesture Recognition
Instruction Scheduling for Clustered VLIW DSPs
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Processor Acceleration Through Automated Instruction Set Customization
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
The perception processor
Merrimac: Supercomputing with Streams
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Exploiting pipelining to relax register-file port constraints of instruction-set extensions
Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
StreamRay: a stream filtering architecture for coherent ray tracing
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
The key to increasing performance without a commensurate increase in power consumption in modern processors lies in increasing both parallelism and core specialization. Core specialization has been employed in the embedded space and is likely to play an important role in future heterogeneous multi-core architectures as well. In this paper, the face recognition application domain is employed as a case study to showcase an architectural design methodology which generates a specialized core with high performance and very low powercharacteristics. Specifically, we create "ASIC-like" execution flows to sustain the high memory parallelism generated within the core. The price of this benefit is a significant increase in compilation complexity. The crux of the problem is the need to co-schedule the often conflicting constraints of data access, data movement, and computation. A modular compiler approach that employs integer linear programming (ILP) based "interconnect-aware" instruction and data scheduling techniques to solve this problem is then described. The resulting core running the compiled code delivers a 1.65x throughput improvement over a high performance processor (Pentium 4) while simultaneously achieving an 80x energy-delay improvement over an energy-efficient processor (XScale) and performs real-time face recognition at embedded power budgets.