Xed: A New Tool for eXtracting Hidden Structures from Electronic Documents

Authors:
Karim Hadjar;Maurizio Rigamonti;Denis Lalanne;Rolf Ingold
Affiliations:
-;-;-;-
Venue:
DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Year:
2004

Citing 0
Cited 20

Thematic segmentation of meetings through document/speech alignment

Proceedings of the 12th annual ACM international conference on Multimedia
Visual signature based identification of Low-resolution document images

Proceedings of the 2004 ACM symposium on Document engineering
Using bi-modal alignment and clustering techniques for documents and speech thematic segmentations

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Enhancing composite digital documents using XML-based standoff markup

Proceedings of the 2005 ACM symposium on Document engineering
Data categorization for a context return applied to logical document structure recognition

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
DocMIR: An automatic document-based indexing system for meeting retrieval

Multimedia Tools and Applications
Visual Analytics: Combining Automated Discovery with Interactive Visualizations

DS '08 Proceedings of the 11th International Conference on Discovery Science
Object-level document analysis of PDF files

Proceedings of the 9th ACM symposium on Document engineering
Improving XED for extracting content from Arabic PDFs

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Table of contents recognition for converting PDF documents in e-book formats

Proceedings of the 10th ACM symposium on Document engineering
Document resizing for visually impaired students

Proceedings of the 22nd Conference of the Computer-Human Interaction Special Interest Group of Australia on Computer-Human Interaction
Detection and resolution of references to meeting documents

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Recognition and classification of figures in PDF documents

GREC'05 Proceedings of the 6th international conference on Graphics Recognition: ten Years Review and Future Perspectives
Reengineering PDF-based documents targeting complex software specifications

International Journal of Knowledge and Web Intelligence
Using static documents as structured and thematic interfaces to multimedia meeting archives

MLMI'04 Proceedings of the First international conference on Machine Learning for Multimodal Interaction
Shallow dialogue processing using machine learning algorithms (or not)

MLMI'04 Proceedings of the First international conference on Machine Learning for Multimodal Interaction
A system for converting PDF documents into structured XML format

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
XCDF: a canonical and structured document format

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Newspaper article reconstruction using ant colony optimization and bipartite graph

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

PDF became a very common format for exchanging printable documents. Further, it can be easily generated from the major documents formats, which make a huge number of PDF documents available over the net. However its use is limited to displaying and printing, which considerably reduces the search and retrieval capabilities. For this reason, additional tools have recently appeared that allow to extract the textual content. However their practical use is limited in the sense that the text's reading order is not necessary preserved, especially when handling multi-column documents, or in presence of complex layout. Our thesis is that those tools do not consider the hidden layout and logical structures of documents, which could greatly improve their results.We propose a novel approach to overcome the document content extraction, by merging a) low-level extraction methods applied on PDF files with b) layout analysis performed on a synthetically generated TIFF image. The paper describes the various steps necessary to achieve this task. Finally, we present a first experiment on the restitution of the newspapers' reading order which shows encouraging results.