Generation of heterogeneous distributed architectures for memory-intensive applications through high-level synthesis

Authors:
Chao Huang;Srivaths Ravi;Anand Raghunathan;Niraj K. Jha
Affiliations:
Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Blacksburg, VA;DSPS Design Team, Texas Instruments, Bangalore, India;NEC Laboratories America, Princeton, NJ;Department of Electrical Engineering, Princeton University, Princeton, NJ
Venue:
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Year:
2007

Citing 34
Cited 2

A Computational Approach to Edge Detection

IEEE Transactions on Pattern Analysis and Machine Intelligence
Combinatorial algorithms for integrated circuit layout

Combinatorial algorithms for integrated circuit layout
High-level synthesis: introduction to chip and system design

High-level synthesis: introduction to chip and system design
Numerical recipes in C (2nd ed.): the art of scientific computing

Numerical recipes in C (2nd ed.): the art of scientific computing
Digital image processing

Digital image processing
Synthesis of application-specific memory designs

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Architectural exploration and optimization of local memory in embedded systems

ISSS '97 Proceedings of the 10th international symposium on System synthesis
Memory size estimation for multimedia applications

Proceedings of the 6th international workshop on Hardware/software codesign
Active pages: a computation model for intelligent memory

Proceedings of the 25th annual international symposium on Computer architecture
Automatic storage management for parallel programs

Parallel Computing - Special issues on languages and compilers for parallel computers
C-based synthesis experiences with a behavior synthesizer, “cyber”

DATE '99 Proceedings of the conference on Design, automation and test in Europe
EXPRESSION: a language for architecture exploration through compiler/simulator retargetability

DATE '99 Proceedings of the conference on Design, automation and test in Europe
Data clustering: a review

ACM Computing Surveys (CSUR)
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
High-level library mapping for memories

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Exact memory size estimation for array computations

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special issue on the 11th international symposium on system-level synthesis and design (ISSS'98)
Systematic data reuse exploration methodology for irregular access patterns

ISSS '00 Proceedings of the 13th international symposium on System synthesis
Compiler Support for Scalable and Efficient Memory Systems

IEEE Transactions on Computers
Automatic Code Mapping on an Intelligent Memory Architecture

IEEE Transactions on Computers
Dependence Analysis for Supercomputing

Dependence Analysis for Supercomputing
Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design

Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design
C++ Algorithms for Digital Signal Processing

C++ Algorithms for Digital Signal Processing
Memory Issues in Embedded Systems-on-Chip: Optimizations and Exploration

Memory Issues in Embedded Systems-on-Chip: Optimizations and Exploration
Image and Video Compression for Multimedia Engineering

Image and Video Compression for Multimedia Engineering
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Energy-aware design of embedded memories: A survey of technologies, architectures, and optimization techniques

ACM Transactions on Embedded Computing Systems (TECS)
Scalable Processors in the Billion-Transistor Era: IRAM

Computer
Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
High-level synthesis of distributed logic-memory architectures

Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design
Embedded intelligent SRAM

Proceedings of the 40th annual Design Automation Conference
Architectural exploration for datapaths with memory hierarchy

EDTC '95 Proceedings of the 1995 European conference on Design and Test
FlexRAM: Toward an Advanced Intelligent Memory System

ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design
PHIDEO: a silicon compiler for high speed algorithms

EURO-DAC '91 Proceedings of the conference on European design automation
Data dependency size estimation for use in memory optimization

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Simultaneous resource binding and interconnection optimization based on a distributed register-file microarchitecture

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Modeling and simulation in a formal design framework

Proceedings of the 6th Balkan Conference in Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Memory-intensive applications present unique chal lenges to an application-specific integrated circuit (ASIC) designer in terms of the choice of memory organization, memory size requirements, bandwidth and access latencies, etc. The high potential of single-chip distributed logic-memory architectures in addressing many of these issues has been recognized in general-purpose computing, and more recently, in ASIC design. The high-level synthesis (HLS) techniques presented in this paper are motivated by the fact that many memory-intensive applications exhibit irregular array data access patterns. Synthesis should therefore, be capable of determining a partitioned architecture wherein array data and computations may have to be heterogeaeously distributed for achieving the best performance speed-up We use a combination of clustering and min-cut style partitioning Lechniques to yield distributed architectures, based on simulation profiling while considering various factors including data access, locality, balanced workloads, inter-partition communication, etc. Our experiments with several benchmark applications show that the proposed techniques yielded two-way partitioned architectures that can achieve upto 2.1 × (average of 1.9 ×) performance speed-up over conventional HLS solutions, while achieving upto 1.5× (average of 1.4×) performance speed-up over the best homogeneous partitioning solution feasible. At the same time the reduction in the energy-delay product over conventional single-memory designs is upto 2.7× (average of 2.0 ×). A large amount of partitioning makes further system performance improvement achievable at the cost of chip area.