Part-of-Speech Tagging for Table of Contents Recognition

  • Authors:
  • Affiliations:
  • Venue:
  • ICPR '00 Proceedings of the International Conference on Pattern Recognition - Volume 4
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

A labeling approach to automatic recognition of tables of contents (ToCs) is described. A prototype is used for consulting electronically scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced with OCR. Labeling is based on a part of speech (POS) tagging. Tagging is initiated by a primary labeling of text component using some specific dictionaries. Significant tags are then grouped in title and author strings and reduced in canonical forms according to contextual rules. Non-labeled tokens are integrated in one or another field per either applying contextual correction rules or using a structure model generated from well-detected articles. The designed prototype operates with a great satisfaction on different TOC layouts and character recognition qualities. Without manual intervention, 95.41% rate of correct segmentation was obtained on 38 journals including 2703 articles and 81.74% rate of correct field extraction.