Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems (TOPLAS)
Communication optimization and code generation for distributed memory machines
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Automatic array alignment in data-parallel programs
POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data transformations for eliminating conflict misses
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Automatic data layout for distributed-memory machines
ACM Transactions on Programming Languages and Systems (TOPLAS)
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts
IEEE Transactions on Parallel and Distributed Systems
Nonsingular Data Transformations: Definition, Validity, and Applications
International Journal of Parallel Programming
Exploiting superword level parallelism with multimedia instruction sets
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Achieving Scalable Locality with Time Skewing
International Journal of Parallel Programming
Increasing and Detecting Memory Address Congruence
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Vectorization for SIMD architectures with alignment constraints
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Automatic tiling of iterative stencil loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Proceedings of the international symposium on Code generation and optimization
Impact of modern memory subsystems on cache optimizations for stencil computations
Proceedings of the 2005 workshop on Memory system performance
Multi-platform Auto-vectorization
Proceedings of the International Symposium on Code Generation and Optimization
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Implicit and explicit optimizations for stencil computations
Proceedings of the 2006 workshop on Memory system performance and correctness
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Effective automatic parallelization of stencil computations
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Outer-loop vectorization: revisited for short SIMD architectures
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A SIMD optimization framework for retargetable compilers
ACM Transactions on Architecture and Code Optimization (TACO)
3D finite difference computation on GPUs using CUDA
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems
Proceedings of the 23rd international conference on Supercomputing
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs
Proceedings of the 23rd international conference on Supercomputing
A Multilevel Parallelization Framework for High-Order Stencil Computations
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization
COMPSAC '09 Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference - Volume 01
Mapping the FDTD Application to Many-Core Chip Architectures
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Data transformations enabling loop vectorization on multithreaded data parallel architectures
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
New algorithms for SIMD alignment
CC'07 Proceedings of the 16th international conference on Compiler construction
A compiler framework for extracting superword level parallelism
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Tiling stencil computations to maximize parallelism
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
When polyhedral transformations meet SIMD code generation
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
A stencil compiler for short-vector SIMD architectures
Proceedings of the 27th international ACM conference on International conference on supercomputing
International Journal of High Performance Computing Applications
Semantics-preserving data layout transformations for improved vectorisation
Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Compiling Scilab to high performance embedded multicore systems
Microprocessors & Microsystems
Hi-index | 0.00 |
Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on three modern SIMD-capable processors.