An experiment in linguistic synthesis with a fuzzy logic controller
International Journal of Human-Computer Studies - Special issue: 1969-1999, the 30th anniversary
Machine Learning for Intelligent Processing of Printed Documents
Journal of Intelligent Information Systems - Special issue on methodologies for intelligent information systems
Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals
IEEE Transactions on Pattern Analysis and Machine Intelligence
Logical Structure Analysis of Book Document Images Using Contents Information
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
A Statistical Learning Approach To Document Image Analysis
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
International Journal on Document Analysis and Recognition
Google Book Search: Document Understanding on a Massive Scale
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
On tables of contents and how to recognize them
International Journal on Document Analysis and Recognition
Book Layout Analysis: TOC Structure Extraction Engine
Advances in Focused Retrieval
Setting up a competition framework for the evaluation of structure extraction from OCR-ed books
International Journal on Document Analysis and Recognition - Special Issue on Performance Evaluation
Hi-index | 0.00 |
In this paper we introduce a powerful document understanding system which is specifically designed for the structural analysis of historical documents. The system was tested against 200 digitised books with about 60,000 pages from the 19th and 20th century. It can also be adapted to other document types, such as newspapers, journals or typescripts. The system uses OCR processed page images as input and it delivers labels for structural elements, such as page numbers, headings, headers, or footnotes as output. The core algorithm of the system is a hybrid rule based approach using fuzzy logic. It combines the power of hand coded rules using domain knowledge with the flexibility of machine learned rules. Additionally a grammar based approach is used for automated validation and refinement.