Implementing Precise Interrupts in Pipelined Processors
IEEE Transactions on Computers
IEEE Transactions on Computers
Interlock collapsing ALU for increased instruction-level parallelism
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
The optimum pipeline depth for a microprocessor
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Increasing processor performance by implementing deeper pipelines
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
High-Performance 3-1 Interlock Collapsing ALU's
IEEE Transactions on Computers
A high performance 32-bit ALU for programmable logic
FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
The Stratix II logic and routing architecture
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Multiplexer restructuring for FPGA implementation cost reduction
Proceedings of the 42nd annual Design Automation Conference
Closing the POWER Gap between ASIC & Custom: Tools and Techniques for Low Power Design
Closing the POWER Gap between ASIC & Custom: Tools and Techniques for Low Power Design
The microarchitecture of FPGA-based soft processors
Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Exploring CAM Design For Network Processing Using FPGA Technology
AICT-ICIW '06 Proceedings of the Advanced Int'l Conference on Telecommunications and Int'l Conference on Internet and Web Applications and Services
SEED: scalable, efficient enforcement of dependences
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
An FPGA-based Pentium® in a complete desktop system
Proceedings of the 2007 ACM/SIGDA 15th international symposium on Field programmable gate arrays
IBM Journal of Research and Development
Intel® atom™ processor core made FPGA-synthesizable
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Mini-graph processing
A 270ps 20mW 108-bit End-around Carry Adder for Multiply-Add Fused Floating Point Unit
Journal of Signal Processing Systems
Intel nehalem processor core made FPGA synthesizable
Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
Efficient multi-ported memories for FPGAs
Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
Measuring the Gap Between FPGAs and ASICs
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Making wide-issue VLIW processors viable on FPGAs
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Multi-ported memories for FPGAs via XOR
Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
Quantifying the cost and benefit of latency insensitive communication on FPGAs
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Optimizing effective interconnect capacitance for FPGA power reduction
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Embedded supercomputing in FPGAs with the VectorBlox MXP matrix processor
Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis
Hi-index | 0.00 |
As soft processors are increasingly used in diverse applications, there is a need to evolve their microarchitectures in a way that suits the FPGA implementation substrate. This paper compares the delay and area of a comprehensive set of processor building block circuits when implemented on custom CMOS and FPGA substrates. We then use the results of these comparisons to infer how the microarchitecture of soft processors on FPGAs should be different from hard processors on custom CMOS. We find that the ratios of the area required by an FPGA to that of custom CMOS for different building blocks varies significantly more than the speed ratios. As area is often a key design constraint in FPGA circuits, area ratios have the most impact on microarchitecture choices. Complete processor cores have area ratios of 17-27x and delay ratios of 18-26x. Building blocks that have dedicated hardware support on FPGAs such as SRAMs, adders, and multipliers are particularly area-efficient (2-7x area ratio), while multiplexers and CAMs are particularly area-inefficient (100x area ratio), leading to cheaper ALUs, larger caches of low associativity, and more expensive bypass networks than on similar hard processors. We also find that a low delay ratio for pipeline latches (12-19x) suggests soft processors should have pipeline depths 20% greater than hard processors of similar complexity.