Automated Detection and Segmentation of Table of Contents Page from Document Images
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Structuring documents according to their table of contents
Proceedings of the 2005 ACM symposium on Document engineering
Identification of Document Structure and Table of Content in Magazine Archives
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Extraction of index components based on contents analysis of journal's scanned cover page
GREC'05 Proceedings of the 6th international conference on Graphics Recognition: ten Years Review and Future Perspectives
A Unified Algorithm for Identification of Various Tabular Structures from Document Images
International Journal of Digital Library Systems
Hi-index | 0.00 |
A labeling approach to automatic recognition of tables of contents (ToCs) is described. A prototype is used for consulting electronically scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced with OCR. Labeling is based on a part of speech (POS) tagging. Tagging is initiated by a primary labeling of text component using some specific dictionaries. Significant tags are then grouped in title and author strings and reduced in canonical forms according to contextual rules. Non-labeled tokens are integrated in one or another field per either applying contextual correction rules or using a structure model generated from well-detected articles. The designed prototype operates with a great satisfaction on different TOC layouts and character recognition qualities. Without manual intervention, 95.41% rate of correct segmentation was obtained on 38 journals including 2703 articles and 81.74% rate of correct field extraction.