Application driven embedded system design: a face recognition case study

Authors:
Karthik Ramani;Al Davis
Affiliations:
University of Utah;University of Utah
Venue:
CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Year:
2007

Citing 19
Cited 1

Iterative modulo scheduling: an algorithm for software pipelining loops

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Neural Network-Based Face Detection

IEEE Transactions on Pattern Analysis and Machine Intelligence
A bandwidth-efficient architecture for media processing

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Space-time scheduling of instruction-level parallelism on a raw machine

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Boosting beyond static scheduling in a superscalar processor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The FERET Evaluation Methodology for Face-Recognition Algorithms

IEEE Transactions on Pattern Analysis and Machine Intelligence
ILP-based Instruction Scheduling for IA-64

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Computer and Robot Vision

Computer and Robot Vision
CALiBeR: a software pipelining algorithm for clustered embedded VLIW processors

Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
Baring It All to Software: Raw Machines

Computer
EPIC: Explicitly Parallel Instruction Computing

Computer
Xtensa: A Configurable and Extensible Processor

IEEE Micro
Discriminant Analysis of Principal Components for Face Recognition

FG '98 Proceedings of the 3rd. International Conference on Face & Gesture Recognition
Instruction Scheduling for Clustered VLIW DSPs

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Processor Acceleration Through Automated Instruction Set Customization

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
The perception processor

The perception processor
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Exploiting pipelining to relax register-file port constraints of instruction-set extensions

Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems

StreamRay: a stream filtering architecture for coherent ray tracing

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The key to increasing performance without a commensurate increase in power consumption in modern processors lies in increasing both parallelism and core specialization. Core specialization has been employed in the embedded space and is likely to play an important role in future heterogeneous multi-core architectures as well. In this paper, the face recognition application domain is employed as a case study to showcase an architectural design methodology which generates a specialized core with high performance and very low powercharacteristics. Specifically, we create "ASIC-like" execution flows to sustain the high memory parallelism generated within the core. The price of this benefit is a significant increase in compilation complexity. The crux of the problem is the need to co-schedule the often conflicting constraints of data access, data movement, and computation. A modular compiler approach that employs integer linear programming (ILP) based "interconnect-aware" instruction and data scheduling techniques to solve this problem is then described. The resulting core running the compiled code delivers a 1.65x throughput improvement over a high performance processor (Pentium 4) while simultaneously achieving an 80x energy-delay improvement over an energy-efficient processor (XScale) and performs real-time face recognition at embedded power budgets.