Operation chaining asynchronous pipelined circuits

Authors:
Girish Venkataramani;Seth C. Goldstein
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
Proceedings of the 2007 IEEE/ACM international conference on Computer-aided design
Year:
2007

Citing 15
Cited 2

Self-timed rings and their application to division

Self-timed rings and their application to division
Performance analysis based on timing simulation

DAC '94 Proceedings of the 31st annual Design Automation Conference
Four-phase micropipeline latch control circuits

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
An optimal clock period selection method based on slack minimization criteria

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Minimum area retiming with equivalent initial states

ICCAD '97 Proceedings of the 1997 IEEE/ACM international conference on Computer-aided design
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Advanced compiler design and implementation

Advanced compiler design and implementation
On the optimization power of retiming and resynthesis transformations

Proceedings of the 1998 IEEE/ACM international conference on Computer-aided design
Resynthesis and peephole transformations for the optimization of large-scale asynchronous systems

Proceedings of the 39th annual Design Automation Conference
Pipeline optimization for asynchronous circuits: complexity analysis and an efficient optimal algorithm

Proceedings of the 2000 IEEE/ACM international conference on Computer-aided design
Bounding Average Time Separations of Events in Stochastic Timed Petri Nets with Choice

ASYNC '99 Proceedings of the 5th International Symposium on Advanced Research in Asynchronous Circuits and Systems
Spatial computation

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
System-level scheduling on instruction cell based reconfigurable systems

Proceedings of the conference on Design, automation and test in Europe: Proceedings
Leveraging protocol knowledge in slack matching

Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design
Global critical path: a tool for system-level timing analysis

Proceedings of the 44th annual Design Automation Conference

Slack analysis in the system design loop

CODES+ISSS '08 Proceedings of the 6th IEEE/ACM/IFIP international conference on Hardware/Software codesign and system synthesis
Performance-driven clustering of asynchronous circuits

PATMOS'11 Proceedings of the 21st international conference on Integrated circuit and system design: power and timing modeling, optimization, and simulation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We define operation chaining (op-chaining) as an optimization problem to determine the optimal pipeline depth for balancing performance against energy demands in pipelined asynchronous designs. Since there are no clock period requirements, asynchronous pipeline stages can have non-uniform latencies. We exploit this fact to coalesce several stages together thereby saving power and area due to the elimination of control-path resources from the pipeline. The trade-off is potentially reduced pipeline parallelism. In this paper, we formally define this optimization as a graph covering problem, which finds sub-graphs that will be synthesized as an opchained pipeline stage. We then define the solution space for provably correct solutions and present an algorithm to efficiently search this space. The search technique partitions the graph based on post-dominator relationships to find sub-graphs that are potential op-chain candidates. We use knowledge of the Global Critical Path (GCP) [13] to evaluate the performance impact of accepting a candidate sub-graph and formulate a heuristic cost function to model this trade-off. The algorithm has a quadratic-time complexity in the size of the dataflow graph. We have implemented this algorithm within an automated asynchronous synthesis toolchain [12]. Experimental evidence from applying the algorithm on several media processing kernels reveals that the average energy-delay and energy-delay-area products improve by about 1.4x and 1.8x respectively, with a maximum improvement of 5x and 18x.