Dolores: An Interactive and Class-Free Approach for Document Logical Restructuring

  • Authors:
  • Jean-Luc Bloechle;Catherine Pugin;Rolf Ingold

  • Affiliations:
  • -;-;-

  • Venue:
  • DAS '08 Proceedings of the 2008 The Eighth IAPR International Workshop on Document Analysis Systems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Physical and logical structure recovering from electronic documents is still an open issue. In this paper, we propose a flexible and efficient approach for recovering document structures from PDF files. After a brief introduction of the PDF format and its major features, we report about our evaluation of different existing tools and works for PDF content extraction and analysis. To overcome the weaknesses of these systems, we propose a new analysis strategy, based on an intermediate representation, called XCDF, which enables representing physical structures in a canonical way. This paper then describes the PDF reverse engineering workflow and focuses on the document logical restructuring. Finally, the paper concludes with potential future improvements.