Rule based document understanding of historical books using a hybrid fuzzy classification system

Authors:
Lukas Gander;Cornelia Lezuo;Raphael Unterweger
Affiliations:
Innsbruck University Library, Innrain, Innsbruck, Austria;Innsbruck University Library, Innrain, Innsbruck, Austria;Innsbruck University Library, Innrain, Innsbruck, Austria
Venue:
Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Year:
2011

Citing 10
Cited 0

An experiment in linguistic synthesis with a fuzzy logic controller

International Journal of Human-Computer Studies - Special issue: 1969-1999, the 30th anniversary
Machine Learning for Intelligent Processing of Printed Documents

Journal of Intelligent Information Systems - Special issue on methodologies for intelligent information systems
Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals

IEEE Transactions on Pattern Analysis and Machine Intelligence
Logical Structure Analysis of Book Document Images Using Contents Information

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
A Statistical Learning Approach To Document Image Analysis

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
A survey of document image classification: problem statement, classifier architecture and performance evaluation

International Journal on Document Analysis and Recognition
Google Book Search: Document Understanding on a Massive Scale

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
On tables of contents and how to recognize them

International Journal on Document Analysis and Recognition
Book Layout Analysis: TOC Structure Extraction Engine

Advances in Focused Retrieval
Setting up a competition framework for the evaluation of structure extraction from OCR-ed books

International Journal on Document Analysis and Recognition - Special Issue on Performance Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we introduce a powerful document understanding system which is specifically designed for the structural analysis of historical documents. The system was tested against 200 digitised books with about 60,000 pages from the 19th and 20th century. It can also be adapted to other document types, such as newspapers, journals or typescripts. The system uses OCR processed page images as input and it delivers labels for structural elements, such as page numbers, headings, headers, or footnotes as output. The core algorithm of the system is a hybrid rule based approach using fuzzy logic. It combines the power of hand coded rules using domain knowledge with the flexibility of machine learned rules. Additionally a grammar based approach is used for automated validation and refinement.