Rule based document understanding of historical books using a hybrid fuzzy classification system

  • Authors:
  • Lukas Gander;Cornelia Lezuo;Raphael Unterweger

  • Affiliations:
  • Innsbruck University Library, Innrain, Innsbruck, Austria;Innsbruck University Library, Innrain, Innsbruck, Austria;Innsbruck University Library, Innrain, Innsbruck, Austria

  • Venue:
  • Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we introduce a powerful document understanding system which is specifically designed for the structural analysis of historical documents. The system was tested against 200 digitised books with about 60,000 pages from the 19th and 20th century. It can also be adapted to other document types, such as newspapers, journals or typescripts. The system uses OCR processed page images as input and it delivers labels for structural elements, such as page numbers, headings, headers, or footnotes as output. The core algorithm of the system is a hybrid rule based approach using fuzzy logic. It combines the power of hand coded rules using domain knowledge with the flexibility of machine learned rules. Additionally a grammar based approach is used for automated validation and refinement.