Towards high-quality text stream extraction from PDF: technical background to the ACL 2012 contributed task

Authors:
Øyvind Raddum Berg;Stephan Oepen;Jonathon Read
Affiliations:
Universitetet i Oslo;Universitetet i Oslo;Universitetet i Oslo
Venue:
ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
Year:
2012

Citing 3
Cited 4

Two Geometric Algorithms for Layout Analysis

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Towards an ACL anthology corpus with logical document structure: an overview of the ACL 2012 contributed task

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
Combining OCR outputs for logical document structure markup: technical background to the ACL 2012 contributed task

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries

Towards an ACL anthology corpus with logical document structure: an overview of the ACL 2012 contributed task

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
Combining OCR outputs for logical document structure markup: technical background to the ACL 2012 contributed task

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
PDFX: fully-automated PDF-to-XML conversion of scientific literature

Proceedings of the 2013 ACM symposium on Document engineering
Automatic generation of limited-depth hyper-documents from clinical guidelines

Proceedings of the 2013 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extracting textual content and document structure from PDF presents a surprisingly (depressingly, to some, in fact) difficult challenge, owing to the purely display-oriented design of the PDF document standard. While a variety of lower-level PDF extraction toolkits exist, none fully support the recovery of original text (in reading order) and relevant structural elements, even for so-called borndigital PDFs, i.e. those prepared electronically using typesetting systems like LATEX, OpenOffice, and the like. This short paper summarizes a new tool for high-quality extraction of text and structure from PDFs, combining state-of-the-art PDF parsing, font interpretation, layout analysis, and TEI-compliant output of text and logical document markup.