Combining OCR outputs for logical document structure markup: technical background to the ACL 2012 contributed task

Authors:
Ulrich Schäfer;Benjamin Weitz
Affiliations:
DFKI Language Technology, Saarbrücken, Germany;DFKI Language Technology, Saarbrücken, Germany
Venue:
ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
Year:
2012

Citing 3
Cited 2

The ACL Anthology Searchbench

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations
Towards an ACL anthology corpus with logical document structure: an overview of the ACL 2012 contributed task

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
Towards high-quality text stream extraction from PDF: technical background to the ACL 2012 contributed task

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries

Towards an ACL anthology corpus with logical document structure: an overview of the ACL 2012 contributed task

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
Towards high-quality text stream extraction from PDF: technical background to the ACL 2012 contributed task

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe how paperXML, a logical document structure markup for scholarly articles, is generated on the basis of OCR tool outputs. PaperXML has been initially developed for the ACL Anthology Searchbench. The main purpose was to robustly provide uniform access to sentences in ACL Anthology papers from the past 46 years, ranging from scanned, typewriter-written conference and workshop proceedings papers, up to recent high-quality typeset, born-digital journal articles, with varying layouts. PaperXML markup includes information on page and paragraph breaks, section headings, footnotes, tables, captions, boldface and italics character styles as well as bibliographic and publication metadata. The role of paperXML in the ACL Contributed Task Rediscovering 50 Years of Discoveries is to serve as fall-back source (1) for older, scanned papers (mostly published before the year 2000), for which born-digital PDF sources are not available, (2) for born-digital PDF papers on which the PDFExtract method failed, (3) for document parts where PDFExtract does not output useful markup such as currently for tables. We sketch transformation of paperXML into the ACL Contributed Task's TEI P5 XML.