Table of contents recognition for converting PDF documents in e-book formats

Authors:
Simone Marinai;Emanuele Marino;Giovanni Soda
Affiliations:
DSI - Universita' di Firenze, Firenze, Italy, Italy;DSI - Universita' di Firenze, Firenze, Italy, Italy;DSI - Universita' di Firenze, Firenze, Italy, Italy
Venue:
Proceedings of the 10th ACM symposium on Document engineering
Year:
2010

Citing 4
Cited 5

Xed: A New Tool for eXtracting Hidden Structures from Electronic Documents

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Table Recognition and Understanding from PDF Files

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
On tables of contents and how to recognize them

International Journal on Document Analysis and Recognition
Metadata Extraction from PDF Papers for Digital Library Ingest

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition

Towards a faithful visualization of historical books on e-book readers

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Challenges in generating bookmarks from TOC entries in e-books

Proceedings of the 2012 ACM symposium on Document engineering
Displaying chemical structural formulae in ePub format

Proceedings of the 2012 ACM symposium on Document engineering
Searching online book documents and analyzing book citations

Proceedings of the 2013 ACM symposium on Document engineering
A System for Social Reading based on EPUB3

Proceedings of International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe one tool for Table of Content (ToC) identification and recognition from PDF books. This task is part of ongoing research on the development of tools for the semi-automatic conversion of PDF documents in the Epub format that can be read on several E-book devices. Among various sub-tasks, the ToC extraction and recognition is particularly useful for an easy navigation of book contents. The proposed tool first identifies the ToC pages. The bounding boxes of ToC titles in the book body are subsequently found in order to add suitable links in the Epub ToC. The proposed approach is tolerant to discrepancies between the ToC text and the corresponding titles. We evaluated the tool on several open access books edited by University Presses that are partner of the OAPEN EcontentPlus project