Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors

Authors:
Yi Wang;Linfeng Pan;Zili Shao;Yong Guan;Minyi Guo
Affiliations:
Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong;Shanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China 200240;Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong;College of Information Engineering, Capital Normal University, Beijing, China;Shanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China 200240
Venue:
Journal of Signal Processing Systems
Year:
2014

Citing 31
Cited 0

Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Compiler Optimizations for Enhancing Parallelism and Their Impact on Architecture Design

IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
Supercompilers for parallel and vector computers

Supercompilers for parallel and vector computers
An integrated compile-time/run-time software distributed shared memory system

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Intel MMX for multimedia PCs

Communications of the ACM
Achieving Full Parallelism Using Multidimensional Retiming

IEEE Transactions on Parallel and Distributed Systems
Advanced compiler design and implementation

Advanced compiler design and implementation
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
High performance Fortran compilation techniques for parallelizing scientific codes

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Complete Guide to Mmx Technology

Complete Guide to Mmx Technology
Automatic intra-register vectorization for the Intel architecture

International Journal of Parallel Programming
AMD 3DNow! Technology: Architecture and Implementations

IEEE Micro
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Increasing and Detecting Memory Address Congruence

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Optimizing OpenMP programs on software distributed shared memory systems

International Journal of Parallel Programming - Special issue: OpenMP: Experiences and implementations
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion

Proceedings of the international symposium on Code generation and optimization
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Communication Optimizations for Fine-Grained UPC Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Xbox 360 System Architecture

IEEE Micro
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Cost minimization while satisfying hard/soft timing constraints for heterogeneous embedded systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding

ACM Transactions on Embedded Computing Systems (TECS)
Compiler-assisted leakage-aware loop scheduling for embedded VLIW DSP processors

Journal of Systems and Software
Optimal Task Scheduling by Removing Inter-Core Communication Overhead for Streaming Applications on MPSoC

RTAS '10 Proceedings of the 2010 16th IEEE Real-Time and Embedded Technology and Applications Symposium
Overhead-aware energy optimization for real-time streaming applications on multiprocessor System-on-Chip

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Memory-Aware Optimal Scheduling with Communication Overhead Minimization for Streaming Applications on Chip Multiprocessors

RTSS '10 Proceedings of the 2010 31st IEEE Real-Time Systems Symposium
WCET-aware re-scheduling register allocation for real-time embedded systems with clustered VLIW architecture

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Optimally Maximizing Iteration-Level Loop Parallelism

IEEE Transactions on Parallel and Distributed Systems
Optimizing Data Allocation for Loops on Embedded Systems with Scratch-Pad Memory

RTCSA '12 Proceedings of the 2012 IEEE International Conference on Embedded and Real-Time Computing Systems and Applications
Optimally Removing Intercore Communication Overhead for Streaming Applications on MPSoCs

IEEE Transactions on Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multimedia SIMD extensions are commonly employed today to speed up media processing. When performing vectorization for SIMD architectures, one of the major issues is to handle the problem of memory alignment. Prior study focused on either vectorizing loops with all memory references being properly aligned, or introducing extra operations to deal with the misaligned memory references. On the other hand, multi-core SIMD architectures require coarse-grain parallelism. Therefore, it is an important problem to study how to parallelize and vectorize loop nests with the awareness of data misalignments. This paper presents a loop transformation scheme that maximizes the parallelism of outermost loops, while the misaligned memory references in innermost loops are reduced. The basic idea of our technique is to align each level of loops in the nest, considering the constraint of dependence relations. To reduce the data misalignments, we establish a mathematical model with a concept of offset-collection and propose an effective heuristic algorithm. For coarser-grain parallelism, we propose some rules to analyze the outermost loop. When transformations are applied, the inner loops are involved to maximize the parallelism. To avoid introducing more data misalignments, the involved innermost loop is handled from other levels of loops. Experimental results show that 7 % to 37 % (on average 18.4 %) misaligned memory references can be reduced. The simulations on CELL show that 1.1x speedup can be reached by reducing the misaligned data, while 6.14x speedup can be achieved by enhancing the parallelism for multi-core.