Structural extraction from visual layout of documents

  • Authors:
  • Binyamin Rosenfeld;Ronen Feldman;Yonatan Aumann

  • Affiliations:
  • ClearForest Corporation, New York, NY;ClearForest Corporation, New York, NY and Bar Ilan University, Ramat Gan, Israel;ClearForest Corporation, New York, NY and Bar Ilan University, Ramat Gan, Israel

  • Venue:
  • Proceedings of the eleventh international conference on Information and knowledge management
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most information extraction systems focus on the textual content of the documents. They treat documents as sequences or of words, disregarding the physical and typographical layout of the information.. While this strategy helps in focusing the extraction process on the key semantic content of the document, much valuable information can also be derived form the document physical appearance. Often, fonts, physical positioning and other graphical characteristics are used to provide additional context to the information. This information is lost with pure-text analysis. In this paper we describe a general procedure for structural extraction, which allows for automatic extraction of entities from the document based on their visual characteristics and relative position in the document layout. Our structural extraction procedure is a learning algorithm, which knows how to automatically generalizes from examples. The procedure is a general one, applicable to any document format with visual and typographical information. We also then describe a specific implementation of the procedure to PDF documents, called PES (PDF Extraction System). PES works with PDF documents and is able to extract such fields such as Author(s), Title, Date, etc. with very high accuracy.