Challenges in exploitation of loop parallelism in embedded applications

Authors:
Arun Kejariwal;Alexander V. Veidenbaum;Alexandru Nicolau;Milind Girkarmark;Xinmin Tian;Hideki Saito
Affiliations:
University of California at Irvine, Irvine, CA, USA;University of California at Irvine, Irvine, CA, USA;University of California at Irvine, Irvine, CA, USA;Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA
Venue:
CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Year:
2006

Citing 14
Cited 4

The expandable split window paradigm for exploiting fine-grain parallelsim

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Application-specific heterogeneous multiprocessor synthesis using differential-evolution

Proceedings of the 11th international symposium on System synthesis
Dynamic vectorization: a mechanism for exploiting far-flung ILP in ordinary programs

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
From recursion to iteration: what are the optimizations?

PEPM '00 Proceedings of the 2000 ACM SIGPLAN workshop on Partial evaluation and semantics-based program manipulation
Notes on recursion elimination

Communications of the ACM
Automatic generation of application-specific architectures for heterogeneous multiprocessor system-on-chip

Proceedings of the 38th annual Design Automation Conference
Structure of Computers and Computations

Structure of Computers and Computations
Short Vector Code Generation for the Discrete Fourier Transform

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
TriMedia CPU64 Architecture

ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design
The Software Optimization Cookbook

The Software Optimization Cookbook
Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance

Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance
The future of multiprocessor systems-on-chips

Proceedings of the 41st annual Design Automation Conference
Synthesis of Application-Specific Heterogeneous Multiprocessor Architectures Using Extensible Processors

VLSID '05 Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design
An Empirical Study On the Vectorization of Multimedia Applications for Multimedia Extensions

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01

Compiling for vector-thread architectures

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Highly-cited ideas in system codesign and synthesis

CODES+ISSS '08 Proceedings of the 6th IEEE/ACM/IFIP international conference on Hardware/Software codesign and system synthesis
On the exploitation of loop-level parallelism in embedded applications

ACM Transactions on Embedded Computing Systems (TECS)
Design and implementation of high-speed buffered crossbars with efficient load balancing for multi-core SoCs

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Embedded processors have been increasingly exploiting hardware parallelism. Vector units, multiple processors or cores, hyper-threading, special-purpose accelerators such as DSPs or cryptographic engines, or a combination of the above have appeared in a number of processors. They serve to address the increasing performance requirements of modern embedded applications. How this hardware parallelism can be exploited by applications is directly related to the amount of parallelism inherent in a target application. In this paper we evaluate the performance potential of different types of parallelism, viz., true thread-level parallelism, speculative thread-level parallelism and vector parallelism, when executing loops. Applications from the industry-standard EEMBC 1.1, EEMBC 2.0 and the MiBench embedded benchmark suites are analyzed using the Intel C compiler. The results show what can be achieved today, provide upper bounds on the performance potential of different types of thread parallelism, and point out a number of issues that need to be addressed to improve performance. The latter include parallelization of libraries such as libc and design of parallel algorithms to allow maximal exploitation of parallelism. The results also point to the need for developing new benchmark suites more suitable to parallel compilation and execution.