Fast optical and process proximity correction algorithms for integrated circuit manufacturing
Fast optical and process proximity correction algorithms for integrated circuit manufacturing
Synthesis of reconfigurable high-performance multicore systems
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
MC-Sim: an efficient simulation tool for MPSoC designs
Proceedings of the 2008 IEEE/ACM International Conference on Computer-Aided Design
FPGA-Based Hardware Acceleration of Lithographic Aerial Image Simulation
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Automatic memory partitioning and scheduling for throughput and power optimization
Proceedings of the 2009 International Conference on Computer-Aided Design
Optical lithography simulation using wavelet transfor
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
From OO to FPGA: fitting round objects into square hardware?
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Automatic memory partitioning and scheduling for throughput and power optimization
ACM Transactions on Design Automation of Electronic Systems (TODAES)
High-performance code generation for stencil computations on GPU architectures
Proceedings of the 26th ACM international conference on Supercomputing
Efficient compilation of CUDA kernels for high-performance computing on FPGAs
ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Hi-index | 0.00 |
Lithography simulation, as an essential step in design for manufacturability (DFM), is still far from computationally efficient. Most leading companies use large clusters of server computers to achieve acceptable turn-around time. Thus co-processor acceleration is very attractive for obtaining increased computational performance with reduced power consumption. This paper describes an implementation of a customized accelerator on FPGA using a polygon-based simulation model. An application-specific memory partitioning scheme is designed to meet the bandwidth requirements for a large number of processing elements. Deep loop pipelining and ping-pong buffer based function block pipelining are also implemented in our design. Initial results show a 15X speedup versus the software implementation running on a microprocessor, and more speedup is expected via further performance tuning. The implementation also leverages state-of-art C-to-RTL synthesis tools. At the same time, we also identified the need for manual architecture-level exploration for parallel implementations