Structural extraction from visual layout of documents

Authors:
Binyamin Rosenfeld;Ronen Feldman;Yonatan Aumann
Affiliations:
ClearForest Corporation, New York, NY;ClearForest Corporation, New York, NY and Bar Ilan University, Ramat Gan, Israel;ClearForest Corporation, New York, NY and Bar Ilan University, Ramat Gan, Israel
Venue:
Proceedings of the eleventh international conference on Information and knowledge management
Year:
2002

Citing 3
Cited 7

Fluid concepts and creative analogies: computer models of the fundamental mechanisms of thought

Fluid concepts and creative analogies: computer models of the fundamental mechanisms of thought
A framework for specifying explicit bias for revision of approximate information extraction rules

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Introduction to Algorithms

Introduction to Algorithms

Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Enhancing enterprise knowledge processes via cross-media extraction

Proceedings of the 4th international conference on Knowledge capture
Structure Extraction from Presentation Slide Information

PRICAI '08 Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Web news categorization using a cross-media document graph

Proceedings of the ACM International Conference on Image and Video Retrieval
An Intelligent information segmentation approach to extract financial data for business valuation

Expert Systems with Applications: An International Journal
A Bayesian network modeling approach for cross media analysis

Image Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most information extraction systems focus on the textual content of the documents. They treat documents as sequences or of words, disregarding the physical and typographical layout of the information.. While this strategy helps in focusing the extraction process on the key semantic content of the document, much valuable information can also be derived form the document physical appearance. Often, fonts, physical positioning and other graphical characteristics are used to provide additional context to the information. This information is lost with pure-text analysis. In this paper we describe a general procedure for structural extraction, which allows for automatic extraction of entities from the document based on their visual characteristics and relative position in the document layout. Our structural extraction procedure is a learning algorithm, which knows how to automatically generalizes from examples. The procedure is a general one, applicable to any document format with visual and typographical information. We also then describe a specific implementation of the procedure to PDF documents, called PES (PDF Extraction System). PES works with PDF documents and is able to extract such fields such as Author(s), Title, Date, etc. with very high accuracy.