Fluid concepts and creative analogies: computer models of the fundamental mechanisms of thought
Fluid concepts and creative analogies: computer models of the fundamental mechanisms of thought
A framework for specifying explicit bias for revision of approximate information extraction rules
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Introduction to Algorithms
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Structured Data Extraction from the Web Based on Partial Tree Alignment
IEEE Transactions on Knowledge and Data Engineering
Enhancing enterprise knowledge processes via cross-media extraction
Proceedings of the 4th international conference on Knowledge capture
Structure Extraction from Presentation Slide Information
PRICAI '08 Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Web news categorization using a cross-media document graph
Proceedings of the ACM International Conference on Image and Video Retrieval
An Intelligent information segmentation approach to extract financial data for business valuation
Expert Systems with Applications: An International Journal
A Bayesian network modeling approach for cross media analysis
Image Communication
Hi-index | 0.00 |
Most information extraction systems focus on the textual content of the documents. They treat documents as sequences or of words, disregarding the physical and typographical layout of the information.. While this strategy helps in focusing the extraction process on the key semantic content of the document, much valuable information can also be derived form the document physical appearance. Often, fonts, physical positioning and other graphical characteristics are used to provide additional context to the information. This information is lost with pure-text analysis. In this paper we describe a general procedure for structural extraction, which allows for automatic extraction of entities from the document based on their visual characteristics and relative position in the document layout. Our structural extraction procedure is a learning algorithm, which knows how to automatically generalizes from examples. The procedure is a general one, applicable to any document format with visual and typographical information. We also then describe a specific implementation of the procedure to PDF documents, called PES (PDF Extraction System). PES works with PDF documents and is able to extract such fields such as Author(s), Title, Date, etc. with very high accuracy.