Object-level document analysis of PDF files

Authors:
Tamir Hassan
Affiliations:
Technische Universität Wien, Wien, Austria
Venue:
Proceedings of the 9th ACM symposium on Document engineering
Year:
2009

Citing 5
Cited 2

AIDAS: Incremental Logical Structure Discovery in PDF Documents

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Extraction, layout analysis and classification of diagrams in PDF documents

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Xed: A New Tool for eXtracting Hidden Structures from Electronic Documents

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Table Recognition and Understanding from PDF Files

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
User-Guided Wrapping of PDF Documents Using Graph Matching Techniques

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition

Book4All: A Tool to Make an e-Book More Accessible to Students with Vision/Visual-Impairments

USAB '09 Proceedings of the 5th Symposium of the Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society on HCI and Usability for e-Inclusion
Document understanding of graphical content in natively digital PDF documents

Proceedings of the 2012 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The PDF format is commonly used for the exchange of documents on the Web and there is a growing need to understand and extract or repurpose data held in PDF documents. Many systems for processing PDF files use algorithms designed for scanned documents, which analyse a page based on its bitmap representation. We believe this approach to be inefficient. Not only does the rasterization step cost processing time, but information is also lost and errors can be introduced. Inspired primarily by the need to facilitate machine extraction of data from PDF documents, we have developed methods to extract textual and graphic content directly from the PDF content stream and represent it as a list of "objects" at a level of granularity suitable for structural understanding of the document. These objects are then grouped into lines, paragraphs and higher-level logical structures using a novel bottom-up segmentation algorithm based on visual perception principles. Experimental results demonstrate the viability of our approach, which is currently used as a basis for HTML conversion and data extraction methods.