AIDAS: Incremental Logical Structure Discovery in PDF Documents
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Creating reusable well-structured PDF as a sequence of component object graphic (COG) elements
Proceedings of the 2003 ACM symposium on Document engineering
Xed: A New Tool for eXtracting Hidden Structures from Electronic Documents
DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
FaericWorld: browsing multimedia events through static documents and links
INTERACT'07 Proceedings of the 11th IFIP TC 13 international conference on Human-computer interaction
Improving XED for extracting content from Arabic PDFs
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Structure extraction from PDF-based book documents
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Browsing multimedia archives through intra- and multimodal cross-documents links
MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Reengineering PDF-based documents targeting complex software specifications
International Journal of Knowledge and Web Intelligence
XCDF: a canonical and structured document format
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Hi-index | 0.00 |
This article presents Xed, a reverse engineering tool for PDF documents, which extracts the original document layout structure. Xed mixes electronic extraction methods with state-of-the-art document analysis techniques and outputs the layout structure in a hierarchical canonical form, i.e. which is universal and independent of the document type. This article first reviews the major traps and tricks of the PDF format. It then introduces the architecture of Xed along with its main modules, and, in particular, the document physical structure extraction algorithm. Later on, a canonical format is proposed and discussed with an example. Finally the results of a practical evaluation are presented, followed by an outline of future works on the logical structure extraction.