Platform 2012, a many-core computing accelerator for embedded SoCs: performance evaluation of visual analytics applications

Authors:
Diego Melpignano;Luca Benini;Eric Flamand;Bruno Jego;Thierry Lepley;Germain Haugou;Fabien Clermidy;Denis Dutoit
Affiliations:
STMicroelectronics - AST, Grenoble, France;STMicroelectronics - AST, Grenoble, France and University of Bologna--DEIS, Bologna, Italy;STMicroelectronics - AST, Grenoble, France;STMicroelectronics - AST, Grenoble, France;STMicroelectronics - AST, Grenoble, France;STMicroelectronics - AST, Grenoble, France;CEA-LETI, Grenoble, France;CEA-LETI, Grenoble, France
Venue:
Proceedings of the 49th Annual Design Automation Conference
Year:
2012

Citing 4
Cited 15

Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
A fully-asynchronous low-power framework for GALS NoC integration

Proceedings of the Conference on Design, Automation and Test in Europe
Design and Performance Evaluation of Image Processing Algorithms on GPUs

IEEE Transactions on Parallel and Distributed Systems
MEVBench: A mobile computer vision benchmarking suite

IISWC '11 Proceedings of the 2011 IEEE International Symposium on Workload Characterization

Scenario-based design flow for mapping streaming applications onto on-chip many-core systems

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Parallel programming patterns for multi-processor SoC: Application to video processing

ACM Transactions on Embedded Computing Systems (TECS) - Special section on ESTIMedia'12, LCTES'11, rigorous embedded systems design, and multiprocessor system-on-chip for cyber-physical systems
Near-Optimal Microprocessor and Accelerators Codesign with Latency and Throughput Constraints

ACM Transactions on Architecture and Code Optimization (TACO)
3D integration for power-efficient computing

Proceedings of the Conference on Design, Automation and Test in Europe
Enabling fine-grained OpenMP tasking on tightly-coupled shared memory clusters

Proceedings of the Conference on Design, Automation and Test in Europe
Improving the programmability of STHORM-based heterogeneous systems with offload-enabled OpenMP

Proceedings of the First International Workshop on Many-core Embedded Systems
Transparent and energy-efficient speculation on NUMA architectures for embedded MPSoCs

Proceedings of the First International Workshop on Many-core Embedded Systems
HARS: A hardware-assisted runtime software for embedded many-core architectures

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Symmetry breaking for multi-criteria mapping and scheduling on multicores

FORMATS'13 Proceedings of the 11th international conference on Formal Modeling and Analysis of Timed Systems
An integrated, programming model-driven framework for NoC-QoS support in cluster-based embedded many-cores

Parallel Computing
FlexTiles: a globally homogeneous but locally heterogeneous manycore architecture

Proceedings of the 6th Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools
A novel compilation approach for image processing graphs on a many-core platform with explicitly managed memory

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Hybrid compile and run-time memory management for a 3D-stacked reconfigurable accelerator

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Explicit routing schemes for implementation of cellular automata on processor arrays

Natural Computing: an international journal
Optimizing two-dimensional DMA transfers for scratchpad Based MPSoCs platforms

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

P2012 is an area- and power-efficient many-core computing accelerator based on multiple globally asynchronous, locally synchronous processor clusters. Each cluster features up to 16 processors with independent instruction streams sharing a multi-banked one-cycle access L1 data memory, a multi-channel DMA engine and specialized hardware for synchronization and aggressive power management. P2012 is 3D stacking ready and can be customized to achieve extreme area and energy efficiency by adding domain-specific HW IPs to the cluster. The first P2012 SoC prototype in 28nm CMOS will sample in Q3, featuring four 16-processor clusters, a 1MB L2 memory and delivering 80GOPS (with 32 bit single precision floating point support) in 18mm2 with 2W power consumption (worst-case). P2012 can run standard OpenCL™ and proprietary Native Programming Model SW components to achieve the highest level of control on application-to-resource mapping. A dedicated version of the OpenCV vision library is provided in the P2012 SW Development Kit to enable visual analytics acceleration. This paper will discuss preliminary performance measurements of common feature extraction and tracking algorithms, parallelized on P2012, versus sequential execution on ARM CPUs.