Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering

Authors:
Maurizio Rigamonti;Jean-Luc Bloechle;Karim Hadjar;Denis Lalanne;Rolf Ingold
Affiliations:
DIVA group, University of Fribourg, Switzerland;DIVA group, University of Fribourg, Switzerland;DIVA group, University of Fribourg, Switzerland;DIVA group, University of Fribourg, Switzerland;DIVA group, University of Fribourg, Switzerland
Venue:
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Year:
2005

Citing 3
Cited 6

AIDAS: Incremental Logical Structure Discovery in PDF Documents

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Creating reusable well-structured PDF as a sequence of component object graphic (COG) elements

Proceedings of the 2003 ACM symposium on Document engineering
Xed: A New Tool for eXtracting Hidden Structures from Electronic Documents

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)

FaericWorld: browsing multimedia events through static documents and links

INTERACT'07 Proceedings of the 11th IFIP TC 13 international conference on Human-computer interaction
Improving XED for extracting content from Arabic PDFs

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Structure extraction from PDF-based book documents

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Browsing multimedia archives through intra- and multimodal cross-documents links

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Reengineering PDF-based documents targeting complex software specifications

International Journal of Knowledge and Web Intelligence
XCDF: a canonical and structured document format

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article presents Xed, a reverse engineering tool for PDF documents, which extracts the original document layout structure. Xed mixes electronic extraction methods with state-of-the-art document analysis techniques and outputs the layout structure in a hierarchical canonical form, i.e. which is universal and independent of the document type. This article first reviews the major traps and tricks of the PDF format. It then introduces the architecture of Xed along with its main modules, and, in particular, the document physical structure extraction algorithm. Later on, a canonical format is proposed and discussed with an example. Finally the results of a practical evaluation are presented, followed by an outline of future works on the logical structure extraction.