A loosely coupled parallel algorithm for standard cell placement
ICCAD '94 Proceedings of the 1994 IEEE/ACM international conference on Computer-aided design
Fast, contention-free combining tree barriers for shared-memory multiprocessors
International Journal of Parallel Programming
Parallel algorithms for FPGA placement
GLSVLSI '00 Proceedings of the 10th Great Lakes symposium on VLSI
Hardware-assisted simulated annealing with application for fast FPGA placement
FPGA '03 Proceedings of the 2003 ACM/SIGDA eleventh international symposium on Field programmable gate arrays
Parallel Simulated Annealing Algorithms for Cell Placement on Hypercube Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
VPR: A new packing, placement and routing tool for FPGA research
FPL '97 Proceedings of the 7th International Workshop on Field-Programmable Logic and Applications
Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design
High-quality, deterministic parallel placement for FPGAs on commodity hardware
Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays
Area and delay trade-offs in the circuit and architecture design of FPGAs
Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays
Enhancing timing-driven FPGA placement for pipelined netlists
Proceedings of the 45th annual Design Automation Conference
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Towards scalable placement for FPGAs
Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
Parallelizing Simulated Annealing-Based Placement Using GPGPU
FPL '10 Proceedings of the 2010 International Conference on Field Programmable Logic and Applications
Hi-index | 0.00 |
This paper describes a parallel implementation of the timing-driven VPR~5.0 simulated annealing engine. By restricting the move distance to a confined neighborhood, it is possible to consider a large number of non-conflicting moves in parallel and achieve a deterministic result. The full timing-driven algorithm is parallelized, including the detailed timing analysis updates done periodically while placement progresses. The limited move slightly degrades the placement quality, but this is necessary to expose greater degrees of parallelism. The overall bounding box metric degrades about 11% and critical path delay metric degrades about 8% compared to VPR's original algorithm, but we show the amount of degradation is independent of the number of threads. Overall, the parallel implementation scales to a speedup of 123x using 25 threads compared to VPR. With additional tuning effort, we believe the algorithm can be scaled to a larger number of threads, perhaps even run on a GPU, with little additional quality degradation.