Citation Recognition for Scientific Publications in Digital Libraries

  • Authors:
  • Dominique Besagni;Abdel Belaïd

  • Affiliations:
  • -;-

  • Venue:
  • DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, a method based on part-of-speech tagging (PoS) is used for bibliographic reference structure. This method operates on a roughly structured ASCII file, produced by OCR. Because of the heterogeneity of the reference structure, the method acts in a bottom-up way, without an a priori model, gathering structural elements from basic tags to sub-fields andfields. Significant tags are first grouped in homogeneous classes according to their categories and then reduced in canonical forms corresponding to record fields: "authors", "title", "conference name", "date", etc. Non labeled tokens are integrated in one or another field byeither applying PoS correction rules or using a inter- or intra-field model generated from well-detected records. The designed prototype operates with a great satisfaction on different record layouts and character recognition qualities. Without manual intervention, 96.6% wordsare correctly attributed, and about 75,9% references are completely segmented from 2,575 references.