Font and function word identification in document recognition
Computer Vision and Image Understanding
Optical Font Recognition Using Typographical Features
IEEE Transactions on Pattern Analysis and Machine Intelligence
Prototype Extraction and Adaptive OCR
IEEE Transactions on Pattern Analysis and Machine Intelligence
Twenty Years of Document Image Analysis in PAMI
IEEE Transactions on Pattern Analysis and Machine Intelligence
Font Recognition Based on Global Texture Analysis
IEEE Transactions on Pattern Analysis and Machine Intelligence - Graph Algorithms and Computer Vision
DjVu: Analyzing and Compressing Scanned Documents for Internet Distribution
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Segmentation of Handprinted Letter Strings Using a Dynamic Programming Algorithm
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Recent progress on the OCRopus OCR system
Proceedings of the International Workshop on Multilingual OCR
FyFont: find-your-font in large font databases
SCIA'07 Proceedings of the 15th Scandinavian conference on Image analysis
Decapod: a flexible, low cost digitization solution for small and medium archives
CBDAR'11 Proceedings of the 4th international conference on Camera-Based Document Analysis and Recognition
Hi-index | 0.00 |
High quality conversions of scanned documents into PDF usually either rely on full OCR or token compression. This paper describes an approach intermediate between those two: it is based on token clustering, but additionally groups tokens into candidate fonts. Our approach has the potential of yielding OCR-like PDFs when the inputs are high quality and degrading to token based compression when the font analysis fails, while preserving full visual fidelity. Our approach is based on an unsupervised algorithm for grouping tokens into candidate fonts. The algorithm constructs a graph based on token proximity and derives token groups by partitioning this graph. In initial experiments on scanned 300 dpi pages containing multiple fonts, this technique reconstructs candidate fonts with 100% accuracy.