Custom Data Layout for Memory Parallelism

Authors:
Byoungro So;Mary W. Hall;Heidi E. Ziegler
Affiliations:
-;-;-
Venue:
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Year:
2004

Citing 20
Cited 17

Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
The memory wall and the CMOS end-point

ACM SIGARCH Computer Architecture News
Automatic data layout for high performance Fortran

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Fast and extensive system-level memory exploration for ATM applications

ISSS '97 Proceedings of the 10th international symposium on System synthesis
Maps: a compiler-managed memory system for raw machines

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
Access pattern based local memory customization for low power embedded systems

Proceedings of the conference on Design, automation and test in Europe
The hardness of cache conscious data placement

POPL '02 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A compiler approach to fast hardware design space exploration in FPGA-based systems

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
A type theory for memory allocation and data layout

POPL '03 Proceedings of the 30th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Latin Squares for Parallel Array Access

IEEE Transactions on Parallel and Distributed Systems
Memory Access Optimization and RAM Inference for Pipeline Vectorization

FPL '99 Proceedings of the 9th International Workshop on Field-Programmable Logic and Applications
Optimizations to prevent cache penalties for the Intel® Itanium® 2 Processor

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
NAPA C: Compiling for a Hybrid RISC/FPGA Architecture

FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
Parallelizing Applications into Silicon

FCCM '99 Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Automatic Allocation of Arrays to Memories in FPGA Processors with Multiple Memory Banks

FCCM '99 Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Access ordering and memory-conscious cache utilization

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
A Buffer Merging Technique for Reducing Memory Requirements of Synchronous Dataflow Specifications

Proceedings of the 12th international symposium on System synthesis
Automatic computation and data decomposition for multiprocessors

Automatic computation and data decomposition for multiprocessors

Evaluating heuristics in automatically mapping multi-loop applications to FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Optimizing Address Code Generation for Array-Intensive DSP Applications

Proceedings of the international symposium on Code generation and optimization
An ILP based approach to address code generation for digital signal processors

GLSVLSI '06 Proceedings of the 16th ACM Great Lakes symposium on VLSI
A practical approach of memory access parallelization to exploit multiple off-chip DDR memories

Proceedings of the 45th annual Design Automation Conference
A compiler approach to managing storage and memory bandwidth in configurable architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Compiling for reconfigurable computing: A survey

ACM Computing Surveys (CSUR)
A design space exploration algorithm in compiling window operation onto reconfigurable hardware

International Journal of Computers and Applications
Optimized generation of memory structure in compiling window operations onto reconfigurable hardware

ARC'07 Proceedings of the 3rd international conference on Reconfigurable computing: architectures, tools and applications
An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors

Journal of Signal Processing Systems
Code transformations for embedded reconfigurable computing architectures

GTTSE'09 Proceedings of the 3rd international summer school conference on Generative and transformational techniques in software engineering III
Array replication to increase parallelism in applications mapped to configurable architectures

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Optimizing data locality using array tiling

Proceedings of the International Conference on Computer-Aided Design
A data layout optimization framework for NUCA-based multicores

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Improving last level cache locality by integrating loop and data transformations

Proceedings of the International Conference on Computer-Aided Design
Reshaping cache misses to improve row-buffer locality in multicore systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Near-optimal and scalable intrasignal in-place optimization for non-overlapping and irregular access schemes

ACM Transactions on Design Automation of Electronic Systems (TODAES)
A scalable and near-optimal representation of access schemes for memory management

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe a generalized approach toderiving a custom data layout in multiple memory banksfor array-based computations, to facilitate high-bandwidthparallel memory accesses in modern architectures wheremultiple memory banks can simultaneously feed one ormore functional units. We do not use a fixed data layout,but rather select application-specific layouts according toaccess patterns in the code. A unique feature of this approachis its flexibility in the presence of code reorderingtransformations, such as the loop nest transformations commonlyapplied to array-based computations. We have implementedthis algorithm in the DEFACTO system, a designenvironment for automatically mapping C programsto hardware implementations for FPGA-based systems. Wepresent experimental results for five multimedia kernels thatdemonstrate the benefits of this approach. Our results showthat custom data layout yields results as good as, or betterthan, naive or fixed cyclic layouts, and is significantly betterfor certain access patterns and in the presence of codereordering transformations. When used in conjunction withunrolling loops in a nest to expose instruction-level parallelism,we observe greater than a 75% reduction in the numberof memory access cycles and speedups ranging from3.96 to 46.7 for 8 memories, as compared to using a singlememory with no unrolling.