The concurrent C programming language
The concurrent C programming language
Region Scheduling: An Approach for Detecting and Redistributing Parallelism
IEEE Transactions on Software Engineering
Supercompilers for parallel and vector computers
Supercompilers for parallel and vector computers
DISC: dynamic instruction stream computer
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
The expandable split window paradigm for exploiting fine-grain parallelsim
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Effective compiler support for predicated execution using the hyperblock
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Transactional memory: architectural support for lock-free data structures
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Instruction-level parallel processing: history, overview, and perspective
The Journal of Supercomputing - Special issue on instruction-level parallelism
The superblock: an effective technique for VLIW and superscalar compilation
The Journal of Supercomputing - Special issue on instruction-level parallelism
Design patterns: elements of reusable object-oriented software
Design patterns: elements of reusable object-oriented software
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Components, frameworks, patterns
Proceedings of the 1997 symposium on Software reusability
Region-based compilation: introduction, motivation, and initial experience
International Journal of Parallel Programming - Special issue on instruction-level parallel processing—part I
Continuous profiling: where have all the cycles gone?
Proceedings of the sixteenth ACM symposium on Operating systems principles
Advanced compiler design and implementation
Advanced compiler design and implementation
Application-specific heterogeneous multiprocessor synthesis using differential-evolution
Proceedings of the 11th international symposium on System synthesis
Dynamic vectorization: a mechanism for exploiting far-flung ILP in ordinary programs
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
From recursion to iteration: what are the optimizations?
PEPM '00 Proceedings of the 2000 ACM SIGPLAN workshop on Partial evaluation and semantics-based program manipulation
Notes on recursion elimination
Communications of the ACM
Proceedings of the conference on Design, automation and test in Europe
Proceedings of the 38th annual Design Automation Conference
Real-Time Systems: Design Principles for Distributed Embedded Applications
Real-Time Systems: Design Principles for Distributed Embedded Applications
Loop Transformations for Restructuring Compilers: The Foundations
Loop Transformations for Restructuring Compilers: The Foundations
MPI: The Complete Reference
Conversion of control dependence to data dependence
POPL '83 Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Structure of Computers and Computations
Structure of Computers and Computations
Data-Parallel Programming on MIMD Computers
IEEE Transactions on Parallel and Distributed Systems
Loop-Level Parallelism in Numeric and Symbolic Programs
IEEE Transactions on Parallel and Distributed Systems
From patterns to frameworks to parallel programs
Parallel Computing - Special issue: Advanced environments for parallel and distributed computing
MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
Short Vector Code Generation for the Discrete Fourier Transform
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design
The Software Optimization Cookbook
The Software Optimization Cookbook
CC++: A Declarative Concurrent Object Oriented Programming Notation
CC++: A Declarative Concurrent Object Oriented Programming Notation
MPIDC '96 Proceedings of the Second MPI Developers Conference
An Introduction to Parallel Object-Oriented Programming with Mentat
An Introduction to Parallel Object-Oriented Programming with Mentat
Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance
Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance
The future of multiprocessor systems-on-chips
Proceedings of the 41st annual Design Automation Conference
Helper Threads via Virtual Multithreading
IEEE Micro
VLSID '05 Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design
An Empirical Study On the Vectorization of Multimedia Applications for Multimedia Extensions
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
An integrated simdization framework using virtual vectors
Proceedings of the 19th annual international conference on Supercomputing
Software and the Concurrency Revolution
Queue - Multiprocessors
Optimizing data permutations for SIMD devices
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
LMPI: MPI for Heterogeneous Embedded Distributed Systems
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Challenges in exploitation of loop parallelism in embedded applications
CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Proceedings of the 20th annual international conference on Supercomputing
Concert/C: a language for distributed programming
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Trace Scheduling: A Technique for Global Microcode Compaction
IEEE Transactions on Computers
Validity of the single processor approach to achieving large scale computing capabilities
AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
SoC-TM: integrated HW/SW support for transactional memory programming on embedded MPSoCs
CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
mTags: augmenting microkernel messages with lightweight metadata
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
Advances in the silicon technology have enabled increasing support for hardware parallelism in embedded processors. Vector units, multiple processors/cores, multithreading, special-purpose accelerators such as DSPs or cryptographic engines, or a combination of the above have appeared in a number of processors. They serve to address the increasing performance requirements of modern embedded applications. To what extent the available hardware parallelism can be exploited is directly dependent on the amount of parallelism inherent in the given application and the congruence between the granularity of hardware and application parallelism. This paper discusses how loop-level parallelism in embedded applications can be exploited in hardware and software. Specifically, it evaluates the efficacy of automatic loop parallelization and the performance potential of different types of parallelism, viz., true thread-level parallelism (TLP), speculative thread-level parallelism and vector parallelism, when executing loops. Additionally, it discusses the interaction between parallelization and vectorization. Applications from both the industry-standard EEMBC®,1 1.1, EEMBC 2.0 and the academic MiBench embedded benchmark suites are analyzed using the Intel®2 C compiler. The results show the performance that can be achieved today on real hardware and using a production compiler, provide upper bounds on the performance potential of the different types of thread-level parallelism, and point out a number of issues that need to be addressed to improve performance. The latter include parallelization of libraries such as libc and design of parallel algorithms to allow maximal exploitation of parallelism. The results also point to the need for developing new benchmark suites more suitable to parallel compilation and execution. 1 Other names and brands may be claimed as the property of others. 2 Intel is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.