Citation Recognition for Scientific Publications in Digital Libraries

Authors:
Dominique Besagni;Abdel Belaïd
Affiliations:
-;-
Venue:
DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Year:
2004

Citing 0
Cited 5

Learning metadata from the evidence in an on-line citation matching scheme

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
A simple method for citation metadata extraction using hidden markov models

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Parsing citations in biomedical articles using conditional random fields

Computers in Biology and Medicine
Semantic-based access to digital document databases

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Cost effective ontology population with data from lists in OCRed historical documents

Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, a method based on part-of-speech tagging (PoS) is used for bibliographic reference structure. This method operates on a roughly structured ASCII file, produced by OCR. Because of the heterogeneity of the reference structure, the method acts in a bottom-up way, without an a priori model, gathering structural elements from basic tags to sub-fields andfields. Significant tags are first grouped in homogeneous classes according to their categories and then reduced in canonical forms corresponding to record fields: "authors", "title", "conference name", "date", etc. Non labeled tokens are integrated in one or another field byeither applying PoS correction rules or using a inter- or intra-field model generated from well-detected records. The designed prototype operates with a great satisfaction on different record layouts and character recognition qualities. Without manual intervention, 96.6% wordsare correctly attributed, and about 75,9% references are completely segmented from 2,575 references.