OCD: An Optimized and Canonical Document Format

Authors:
Jean-Luc Bloechle;Denis Lalanne;Rolf Ingold
Affiliations:
-;-;-
Venue:
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Year:
2009

Citing 0
Cited 1

Improving XED for extracting content from Arabic PDFs

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Revealing and being able to manipulate the structured content of PDF documents is a difficult task, requiring pre-processing and reverse engineering techniques. In this paper, we present OCD, an optimized, easy-to-process and canonical format for representing structured electronic documents. The system and methods used for reverse engineering PDF documents into the OCD format are presented as well as the techniques to optimize it. We finally expose concrete evaluations of our OCD format compactness and restructuring performances.