A Segmentation Method for Bibliographic References by Contextual Tagging of Fields

Authors:
Dominique Besagni;Abdel Belaïd;Nelly Benet
Affiliations:
-;-;-
Venue:
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Year:
2003

Citing 3
Cited 5

Visualizing science by citation mapping

Journal of the American Society for Information Science
Digital Libraries and Autonomous Citation Indexing

Computer
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing

CEBBIP: a parser of bibliographic information in chinese electronic books

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Multi-page document analysis based on format consistency and clustering

International Journal of Computer Applications in Technology
Evidence-based information extraction for high accuracy citation and author name identification

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Web-based citation parsing, correction and augmentation

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Cost effective ontology population with data from lists in OCRed historical documents

Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper, a method based on part-of-speech tagging(PoS) is used for bibliographic reference structure. Thismethod operates on a roughly structured ASCII file,produced by OCR.. Because of the heterogeneity of thereference structure, the method acts in a bottom-up way,without an a priori model, gathering structural elementsfrom basic tags to sub-fields and fields. Significant tagsare first grouped in homogeneous classes according totheir grammar categories and then reduced in canonicalforms corresponding to record fields: ``authors'', "title","conference name:, "date", etc. Non labelled tokens areintegrated in one or another field by either applying PoScorrection rules or using a structure model generatedfrom well-detected records. The designed prototypeoperates with a great satisfaction on different recordlayouts and character recognition qualities. Withoutmanual intervention, 96.6% words are correctlyattributed, and about 75,9% references are completelysegmented from 2500 references.