Machine Learning for Intelligent Processing of Printed Documents

  • Authors:
  • Floriana Esposito;Donato Malerba;Francesca A. Lisi

  • Affiliations:
  • Dipartimento di Informatica, Università degli Studi di Bari, via Orabona 4, 70125 Bari, Italy. esposito@di.uniba.it;Dipartimento di Informatica, Università degli Studi di Bari, via Orabona 4, 70125 Bari, Italy. malerba@di.uniba.it;Dipartimento di Informatica, Università degli Studi di Bari, via Orabona 4, 70125 Bari, Italy. lisi@di.uniba.it

  • Venue:
  • Journal of Intelligent Information Systems - Special issue on methodologies for intelligent information systems
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

A paper document processing system is an information systemcomponent which transforms information on printed or handwrittendocuments into a computer-revisable form. In intelligent systems forpaper document processing this information capture process is basedon knowledge of the specific layout and logical structures of thedocuments. This article proposes the application of machine learningtechniques to acquire the specific knowledge required by anintelligent document processing system, named WISDOM++, that managesprinted documents, such as letters and journals. Knowledge isrepresented by means of decision trees and first-order rulesautomatically generated from a set of training documents. Inparticular, an incremental decision tree learning system is appliedfor the acquisition of decision trees used for the classification ofsegmented blocks, while a first-order learning system is applied forthe induction of rules used for the layout-based classification andunderstanding of documents. Issues concerning the incrementalinduction of decision trees and the handling of both numeric andsymbolic data in first-order rule learning are discussed, and thevalidity of the proposed solutions is empirically evaluated byprocessing a set of real printed documents.