Fast optical and process proximity correction algorithms for integrated circuit manufacturing
Fast optical and process proximity correction algorithms for integrated circuit manufacturing
Proceedings of the 2007 IEEE/ACM international conference on Computer-aided design
Lithographic aerial image simulation with FPGA-based hardwareacceleration
Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays
Byte and modulo addressable parallel memory architecture for video coding
IEEE Transactions on Circuits and Systems for Video Technology
LegUp: high-level synthesis for FPGA-based processor/accelerator systems
Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Accelerating aerial image simulation with GPU
Proceedings of the International Conference on Computer-Aided Design
Architecture support for accelerator-rich CMPs
Proceedings of the 49th Annual Design Automation Conference
LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems
ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
From software to accelerators with LegUp high-level synthesis
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Efficient aerial image simulation on multi-core SIMD CPU
Proceedings of the International Conference on Computer-Aided Design
From design to design automation
Proceedings of the 2014 on International symposium on physical design
Hi-index | 0.00 |
Lithography simulation, an essential step in design for manufacturability (DFM), is still far from computationally efficient. Most leading companies use large clusters of server computers to achieve acceptable turn-around time. Thus coprocessor acceleration is very attractive for obtaining increased computational performance with a reduced power consumption. This article describes the implementation of a customized accelerator on FPGA using a polygon-based simulation model. An application-specific memory partitioning scheme is designed to meet the bandwidth requirements for a large number of processing elements. Deep loop pipelining and ping-pong buffer based function block pipelining are also implemented in our design. Initial results show a 15X speedup versus the software implementation running on a microprocessor, and more speedup is expected via further performance tuning. The implementation also leverages state-of-art C-to-RTL synthesis tools. At the same time, we also identify the need for manual architecture-level exploration for parallel implementations. Moreover, we implement the algorithm on NVIDIA GPUs using the CUDA programming environment, and provide some useful comparisons for different kinds of accelerators.