Improving XED for extracting content from Arabic PDFs

Authors:
Karim Hadjar;Rolf Ingold
Affiliations:
Ahlia University, Manama, Kingdom of Bahrain;University of Fribourg, Fribourg, Switzerland
Venue:
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Year:
2010

Citing 9
Cited 0

Extraction, layout analysis and classification of diagrams in PDF documents

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Creating reusable well-structured PDF as a sequence of component object graphic (COG) elements

Proceedings of the 2003 ACM symposium on Document engineering
Xed: A New Tool for eXtracting Hidden Structures from Electronic Documents

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Creating structured PDF files using XML templates

Proceedings of the 2004 ACM symposium on Document engineering
Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Dolores: An Interactive and Class-Free Approach for Document Logical Restructuring

DAS '08 Proceedings of the 2008 The Eighth IAPR International Workshop on Document Analysis Systems
OCD: An Optimized and Canonical Document Format

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
A system for converting PDF documents into structured XML format

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
XCDF: a canonical and structured document format

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

PDF documents are widely used but the extraction and the manipulation and of their structured content is not an easy task. It requires sophisticated pre-processing and reverse engineering techniques to get such achievements. In this paper, we present an improvement of XED in order to handle unresolved issues related to the analysis of Arabic documents. A set of rules were proposed and implemented to enhance the extraction of Arabic content, by taking care of the different Arabic fonts, through mapping the un-interpreted Unicode values to the other interpreted sets as well as applying a reverse algorithm whenever needed. We finally expose concrete evaluations for the improvement of XED.