Unifying data and control transformations for distributed shared-memory machines
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
The memory wall and the CMOS end-point
ACM SIGARCH Computer Architecture News
Automatic data layout for high performance Fortran
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Fast and extensive system-level memory exploration for ATM applications
ISSS '97 Proceedings of the 10th international symposium on System synthesis
Maps: a compiler-managed memory system for raw machines
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Smart Memories: a modular reconfigurable architecture
Proceedings of the 27th annual international symposium on Computer architecture
Access pattern based local memory customization for low power embedded systems
Proceedings of the conference on Design, automation and test in Europe
The hardness of cache conscious data placement
POPL '02 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A compiler approach to fast hardware design space exploration in FPGA-based systems
PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
A type theory for memory allocation and data layout
POPL '03 Proceedings of the 30th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
IEEE Transactions on Parallel and Distributed Systems
Latin Squares for Parallel Array Access
IEEE Transactions on Parallel and Distributed Systems
Memory Access Optimization and RAM Inference for Pipeline Vectorization
FPL '99 Proceedings of the 9th International Workshop on Field-Programmable Logic and Applications
Optimizations to prevent cache penalties for the Intel® Itanium® 2 Processor
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
NAPA C: Compiling for a Hybrid RISC/FPGA Architecture
FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
Parallelizing Applications into Silicon
FCCM '99 Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Automatic Allocation of Arrays to Memories in FPGA Processors with Multiple Memory Banks
FCCM '99 Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Access ordering and memory-conscious cache utilization
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
A Buffer Merging Technique for Reducing Memory Requirements of Synchronous Dataflow Specifications
Proceedings of the 12th international symposium on System synthesis
Automatic computation and data decomposition for multiprocessors
Automatic computation and data decomposition for multiprocessors
Evaluating heuristics in automatically mapping multi-loop applications to FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Optimizing Address Code Generation for Array-Intensive DSP Applications
Proceedings of the international symposium on Code generation and optimization
An ILP based approach to address code generation for digital signal processors
GLSVLSI '06 Proceedings of the 16th ACM Great Lakes symposium on VLSI
A practical approach of memory access parallelization to exploit multiple off-chip DDR memories
Proceedings of the 45th annual Design Automation Conference
A compiler approach to managing storage and memory bandwidth in configurable architectures
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Compiling for reconfigurable computing: A survey
ACM Computing Surveys (CSUR)
A design space exploration algorithm in compiling window operation onto reconfigurable hardware
International Journal of Computers and Applications
Optimized generation of memory structure in compiling window operations onto reconfigurable hardware
ARC'07 Proceedings of the 3rd international conference on Reconfigurable computing: architectures, tools and applications
An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors
Journal of Signal Processing Systems
Code transformations for embedded reconfigurable computing architectures
GTTSE'09 Proceedings of the 3rd international summer school conference on Generative and transformational techniques in software engineering III
Array replication to increase parallelism in applications mapped to configurable architectures
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Optimizing data locality using array tiling
Proceedings of the International Conference on Computer-Aided Design
A data layout optimization framework for NUCA-based multicores
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Improving last level cache locality by integrating loop and data transformations
Proceedings of the International Conference on Computer-Aided Design
Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
ACM Transactions on Design Automation of Electronic Systems (TODAES)
A scalable and near-optimal representation of access schemes for memory management
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
In this paper, we describe a generalized approach toderiving a custom data layout in multiple memory banksfor array-based computations, to facilitate high-bandwidthparallel memory accesses in modern architectures wheremultiple memory banks can simultaneously feed one ormore functional units. We do not use a fixed data layout,but rather select application-specific layouts according toaccess patterns in the code. A unique feature of this approachis its flexibility in the presence of code reorderingtransformations, such as the loop nest transformations commonlyapplied to array-based computations. We have implementedthis algorithm in the DEFACTO system, a designenvironment for automatically mapping C programsto hardware implementations for FPGA-based systems. Wepresent experimental results for five multimedia kernels thatdemonstrate the benefits of this approach. Our results showthat custom data layout yields results as good as, or betterthan, naive or fixed cyclic layouts, and is significantly betterfor certain access patterns and in the presence of codereordering transformations. When used in conjunction withunrolling loops in a nest to expose instruction-level parallelism,we observe greater than a 75% reduction in the numberof memory access cycles and speedups ranging from3.96 to 46.7 for 8 memories, as compared to using a singlememory with no unrolling.