System-level timing analysis and optimizations for hardware compilation

  • Authors:
  • Seth Copen Goldstein;Girish Venkataramani

  • Affiliations:
  • Carnegie Mellon University;Carnegie Mellon University

  • Venue:
  • System-level timing analysis and optimizations for hardware compilation
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

This dissertation presents a System-Level Timing Analysis (SLTA) methodology and a micro-architectural optimization framework for use within hardware compilation. As the EDA abstraction layer of preference is raised to Electronic System Level (ESL), the focus is on describing systems using Transaction Level Modeling (TLM) [CG03, Pas02, Ede06], which is amenable to high-level synthesis. The proposed SLTA methodology and ESL optimization framework is designed to complement TLM-based synthesis flows by analyzing the sequential dependency behavior of system-level transactions. Using this knowledge, control-path-altering, micro-architecture optimizations are applied iteratively on a well-defined hardware Intermediate Representation (IR). There are two over-arching contributions in this dissertation. First, we describe an Intermediate Representation (IR) as a valuable addition to the infrastructure of a hardware compiler. The IR captures data/control dependencies in the source program as well as resource dependencies of the underlying circuit architecture. The IR is an abstraction of transaction events in the TLM but is also linked to the RTL control-path signals that implement the TLM specification and communication protocols. By analyzing the properties of the IR, a set of timing entities is produced that characterize system-level performance. The goal of these timing entities is to characterize the system's sequential execution attributes. This is defined by cycle time [Bur91, NK94, IP95, Das04], or initiation interval [RG81, Lairn88], which specifies the time interval between successive iterations of hardware execution. Instead of representing system-level timing as a solitary number (cycle time), we propose using a set of fine-grained building blocks that describe various aspects of system-level timing. The primary building block is slack, which is defined as the time difference between the firing of a given event and when the event is used in a transaction downstream. Using slack, we define the Global Critical Path. (GCP) of the system as the longest path of zero slack (or critical) events. The GCP, in essence, traces the hardware events that directly contribute to the system-wide cycle time. A third entity, global slack, a derivative of both slack and the GCP, specifies how early an event is produced before it is used in a GCP transaction downstream. All three entities are recorded as annotations on the proposed IR, which enables the hardware compiler to easily make value judgments regarding the costs and benefits of local circuit transformations. Second, we describe an ESL optimization framework built on top of the proposed IR. The framework is designed to support optimizations that apply IR-to-IR transformations. We also describe a fast update function that can re-compute changes to the system-level timing entities when a given IR-to-IR transformation is applied. This enables the development of optimization algorithms and design exploration tools that can scan the design space by iteratively applying a series of quality-enhancing local transformations. The main benefit of this approach is that it separates high-level synthesis from high-level optimizations. Thus, the hardware compiler can evaluate different circuit architectures before committing to one that will be synthesized. Since the GCP, slack and global slack are excellent indicators of system-level performance, they help in focusing the optimization and/or exploration effort toward circuit sub-systems that are most critical for performance (or to sub-systems that are most non-critical if the objective is power/area minimization). A key ingredient that makes this framework practical and scalable is the ability to efficiently update the timing entities in response to structural changes introduced by optimizations. The proposed linear-time update forms the "glue" that allows the re-use of the timing entities without having to re-analyze system-level performance. The dissertation makes the following claims: (1) slack and GCP sufficiently and accurately model a fine-grained representation of cycle time; (2) computing the changes in cycle time after applying a circuit transformation is linear in the size of the design; (3) several existing circuit optimizations can be re-formulated to use this methodology, resulting in the development of efficient, quadratic-time, heuristic algorithms to solve hard optimization problems. We present a proof of concept by embedding the SLTA methodology and ESL, optimization framework within CASH (Compiler for Application Specific Hardware), a hardware compiler that synthesizes asynchronous circuit implementations from C programs. Three pipeline optimizations were applied in series to improve energy efficiency: slack matching, operation chaining and hybrid latch synthesis. Experimental results from using the SLTA framework to optimize several media processing kernels using these transformations reveal that the average energy-delay and energy-delay-area products improve by about 1.44x and 2x respectively, with peak improvements of 5.3x and 18.5x respectively. Further, using the timing update algorithm instead of a complete timing re-analysis between optimizations reduces the total optimization loop runtime from several hours down to a few seconds with a duality degradation of less than 1% in terms of energy-delay, area and performance.