Mining Publication Records on Personal Publication Web Pages Based on Conditional Random Fields

Authors:
Jen-Ming Chung;Ya-Huei Lin;Hahn-Ming Lee;Jan-Ming Ho
Affiliations:
-;-;-;-
Venue:
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Year:
2012

Citing 7
Cited 0

A linear space algorithm for computing maximal common subsequences

Communications of the ACM
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Information extraction from research papers using conditional random fields

Information Processing and Management: an International Journal
Reference metadata extraction using a hierarchical knowledge representation framework

Decision Support Systems
FireCite: lightweight real-time reference string extraction from webpages

NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
Locating and parsing bibliographic references in HTML medical articles

International Journal on Document Analysis and Recognition - Special Issue DRR09
Self-supervised learning approach for extracting citation information on the web

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

A publication record denotes a list of semi-structured citation string of publications of a research institute or an individual researcher. Publication records are integrated into a digital library to become an important knowledge base which in turn enables a variety of applications. A publication record is usually found among other information on a publication Web page (or publication page for short). It is thus an interesting problem to extract publication record from these Web pages. The problem is difficult due to several reasons including the flexibility in formatting the metadata of a publication into a semi-structured citation string and expressing the citation string into its visual presentation in HTML. Furthermore, two citation strings with similar visual presentation on the same Web page may have different HTML constructs. In this paper, we present a content analysis approach based on Conditional Random Fields and data region boundary analysis to automatically extract citation record on a publication page. Experimental results show that our method performs well on a benchmark containing manually crafted publication Web pages. The precision, recall, and F-measure are 82.5%, 87.6%, and 85.0% respectively. This is an improvement over previous results.