The case for a single-chip multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Hardware/software partitioning and pipelining
DAC '97 Proceedings of the 34th annual Design Automation Conference
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
RECOD: a retiming heuristic to optimize resource and memory utilization in HW/SW codesigns
Proceedings of the 6th international workshop on Hardware/software codesign
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
Loop tiling for parallelism
A constructive algorithm for memory-aware task assignment and scheduling
Proceedings of the ninth international symposium on Hardware/software codesign
Dynamic management of scratch-pad memory space
Proceedings of the 38th annual Design Automation Conference
Scratchpad memory: design alternative for cache on-chip memory in embedded systems
Proceedings of the tenth international symposium on Hardware/software codesign
Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications
EDTC '97 Proceedings of the 1997 European conference on Design and Test
Data Reuse Analysis Technique for Software-Controlled Memory Hierarchies
Proceedings of the conference on Design, automation and test in Europe - Volume 1
Layer Assignment echniques for Low Energy in Multi-Layered Memory Organisations
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
FORAY-GEN: Automatic Generation of Affine Functions for Memory Optimizations
Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
Data partitioning for maximal scratchpad usage
ASP-DAC '03 Proceedings of the 2003 Asia and South Pacific Design Automation Conference
Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies
Proceedings of the 43rd annual Design Automation Conference
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Heterogeneous multiprocessor implementations for JPEG:: a case study
CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Integrated scratchpad memory optimization and task scheduling for MPSoC architectures
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Design methodology for pipelined heterogeneous multiprocessor system
Proceedings of the 44th annual Design Automation Conference
SoCDAL: System-on-chip design AcceLerator
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Compiler driven data layout optimization for regular/irregular array access patterns
Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Pipelined data parallel task mapping/scheduling technique for MPSoC
Proceedings of the Conference on Design, Automation and Test in Europe
Macro pipelining based scheduling on high performance heterogeneousmultiprocessor systems
IEEE Transactions on Signal Processing
Quantum-inspired evolutionary algorithm for a class of combinatorial optimization
IEEE Transactions on Evolutionary Computation
Hi-index | 0.00 |
The increasing demand for low-power and high-performance multimedia embedded systems has motivated the need for effective solutions to satisfy application bandwidth and latency requirements under a tight power budget. As technology scales, it is imperative that applications are optimized to take full advantage of the underlying resources and meet both power and performance requirements. We propose MultiMaKe, an application mapping design flow capable of discovering and enabling parallelism opportunities via code transformations, efficiently distributing the computational load across resources, and minimizing unnecessary data transfers. Our approach decomposes the application's tasks into smaller units of computations called kernels, which are distributed and pipelined across the different processing resources. We exploit the ideas of inter-kernel data reuse to minimize unnecessary data transfers between kernels, early execution edges to drive performance, and kernel pipelining to increase system throughput. Our experimental results on JPEG and JPEG2000 show up to 97% off-chip memory access reduction, and up to 80% execution time reduction over standard mapping and task-level pipelining approaches.