Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems (TOPLAS)
Compiler Optimizations for Enhancing Parallelism and Their Impact on Architecture Design
IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
Supercompilers for parallel and vector computers
Supercompilers for parallel and vector computers
An integrated compile-time/run-time software distributed shared memory system
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Communications of the ACM
Achieving Full Parallelism Using Multidimensional Retiming
IEEE Transactions on Parallel and Distributed Systems
Advanced compiler design and implementation
Advanced compiler design and implementation
Exploiting superword level parallelism with multimedia instruction sets
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
High performance Fortran compilation techniques for parallelizing scientific codes
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Complete Guide to Mmx Technology
Complete Guide to Mmx Technology
Automatic intra-register vectorization for the Intel architecture
International Journal of Parallel Programming
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
Increasing and Detecting Memory Address Congruence
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Vectorization for SIMD architectures with alignment constraints
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Optimizing OpenMP programs on software distributed shared memory systems
International Journal of Parallel Programming - Special issue: OpenMP: Experiences and implementations
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Proceedings of the international symposium on Code generation and optimization
Optimizing Compiler for the CELL Processor
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Communication Optimizations for Fine-Grained UPC Applications
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
IEEE Micro
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Cost minimization while satisfying hard/soft timing constraints for heterogeneous embedded systems
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding
ACM Transactions on Embedded Computing Systems (TECS)
Compiler-assisted leakage-aware loop scheduling for embedded VLIW DSP processors
Journal of Systems and Software
RTAS '10 Proceedings of the 2010 16th IEEE Real-Time and Embedded Technology and Applications Symposium
ACM Transactions on Design Automation of Electronic Systems (TODAES)
RTSS '10 Proceedings of the 2010 31st IEEE Real-Time Systems Symposium
Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Optimally Maximizing Iteration-Level Loop Parallelism
IEEE Transactions on Parallel and Distributed Systems
Optimizing Data Allocation for Loops on Embedded Systems with Scratch-Pad Memory
RTCSA '12 Proceedings of the 2012 IEEE International Conference on Embedded and Real-Time Computing Systems and Applications
Optimally Removing Intercore Communication Overhead for Streaming Applications on MPSoCs
IEEE Transactions on Computers
Hi-index | 0.00 |
Multimedia SIMD extensions are commonly employed today to speed up media processing. When performing vectorization for SIMD architectures, one of the major issues is to handle the problem of memory alignment. Prior study focused on either vectorizing loops with all memory references being properly aligned, or introducing extra operations to deal with the misaligned memory references. On the other hand, multi-core SIMD architectures require coarse-grain parallelism. Therefore, it is an important problem to study how to parallelize and vectorize loop nests with the awareness of data misalignments. This paper presents a loop transformation scheme that maximizes the parallelism of outermost loops, while the misaligned memory references in innermost loops are reduced. The basic idea of our technique is to align each level of loops in the nest, considering the constraint of dependence relations. To reduce the data misalignments, we establish a mathematical model with a concept of offset-collection and propose an effective heuristic algorithm. For coarser-grain parallelism, we propose some rules to analyze the outermost loop. When transformations are applied, the inner loops are involved to maximize the parallelism. To avoid introducing more data misalignments, the involved innermost loop is handled from other levels of loops. Experimental results show that 7 % to 37 % (on average 18.4 %) misaligned memory references can be reduced. The simulations on CELL show that 1.1x speedup can be reached by reducing the misaligned data, while 6.14x speedup can be achieved by enhancing the parallelism for multi-core.