Identification of Document Structure and Table of Content in Magazine Archives

Authors:
Sherif Yacoub;Jose Abad Peiro
Affiliations:
HP Labs, Spain;Hewlett-Packard, Spain
Venue:
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Year:
2005

Citing 4
Cited 1

Document Understanding Using Probabilistic Relaxation: Application on Tables of Contents of Periodicals

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Part-of-Speech Tagging for Table of Contents Recognition

ICPR '00 Proceedings of the International Conference on Pattern Recognition - Volume 4
Automated Detection and Segmentation of Table of Contents Page from Document Images

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Text-mining based journal splitting

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2

Document digitization lifecycle for complex magazine collection

Proceedings of the 2005 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a generic approach for reliable identification of the table of content (TOC) pages in scanned documents. We use multiple sources of information to obtain a reliable assessment of the TOC pages and the position of articles. These sources are produced by using three methods: title matching, section keyword matching, and numeric content. Finally a combination component is used to score potential TOC pages and select the best candidates. The system is used to identify the table of content, locate the beginning of articles, aid the process of advertisement identification (where present), and in general, identify the structure of scanned documents for the process of article extraction and online deployment of digital content. Results of applying the algorithms to an 80-years archive of Time weekly magazine are presented.