A system for converting PDF documents into structured XML format

Authors:
Hervé Déjean;Jean-Luc Meunier
Affiliations:
Xerox Research Centre Europe, Meylan;Xerox Research Centre Europe, Meylan
Venue:
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Year:
2006

Citing 4
Cited 9

Semirings, automata, languages

Semirings, automata, languages
Xed: A New Tool for eXtracting Hidden Structures from Electronic Documents

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Structuring documents according to their table of contents

Proceedings of the 2005 ACM symposium on Document engineering
Optimized XY-Cut for Determining a Page Reading Order

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition

Job profiling in high performance printing

Proceedings of the 9th ACM symposium on Document engineering
FormSys: form-processing web services

Proceedings of the 19th international conference on World wide web
Development of the XML digital library from the parliament of Andalucía for intelligent structured retrieval

ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
Improving XED for extracting content from Arabic PDFs

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Structure extraction from PDF-based book documents

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Job profiling and queue management in high performance printing

Computer Science - Research and Development
Document understanding of graphical content in natively digital PDF documents

Proceedings of the 2012 ACM symposium on Document engineering
A practical method for compatibility evaluation of portable document formats

ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part II
PDFX: fully-automated PDF-to-XML conversion of scientific literature

Proceedings of the 2013 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present in this paper a system for converting PDF legacy documents into structured XML format. This conversion system first extracts the different streams contained in PDF files (text, bitmap and vectorial images) and then applies different components in order to express in XML the logically structured documents. Some of these components are traditional in Document Analysis, other more specific to PDF. We also present a graphical user interface in order to check, correct and validate the analysis of the components. We eventually report on two real user cases where this system was applied on.