Part-of-Speech Tagging for Table of Contents Recognition

Authors:
Affiliations:
Venue:
ICPR '00 Proceedings of the International Conference on Pattern Recognition - Volume 4
Year:
2000

Citing 0
Cited 5

Automated Detection and Segmentation of Table of Contents Page from Document Images

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Structuring documents according to their table of contents

Proceedings of the 2005 ACM symposium on Document engineering
Identification of Document Structure and Table of Content in Magazine Archives

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Extraction of index components based on contents analysis of journal's scanned cover page

GREC'05 Proceedings of the 6th international conference on Graphics Recognition: ten Years Review and Future Perspectives
A Unified Algorithm for Identification of Various Tabular Structures from Document Images

International Journal of Digital Library Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

A labeling approach to automatic recognition of tables of contents (ToCs) is described. A prototype is used for consulting electronically scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced with OCR. Labeling is based on a part of speech (POS) tagging. Tagging is initiated by a primary labeling of text component using some specific dictionaries. Significant tags are then grouped in title and author strings and reduced in canonical forms according to contextual rules. Non-labeled tokens are integrated in one or another field per either applying contextual correction rules or using a structure model generated from well-detected articles. The designed prototype operates with a great satisfaction on different TOC layouts and character recognition qualities. Without manual intervention, 95.41% rate of correct segmentation was obtained on 38 journals including 2703 articles and 81.74% rate of correct field extraction.