Mining Publication Records on Personal Publication Web Pages Based on Conditional Random Fields

  • Authors:
  • Jen-Ming Chung;Ya-Huei Lin;Hahn-Ming Lee;Jan-Ming Ho

  • Affiliations:
  • -;-;-;-

  • Venue:
  • WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

A publication record denotes a list of semi-structured citation string of publications of a research institute or an individual researcher. Publication records are integrated into a digital library to become an important knowledge base which in turn enables a variety of applications. A publication record is usually found among other information on a publication Web page (or publication page for short). It is thus an interesting problem to extract publication record from these Web pages. The problem is difficult due to several reasons including the flexibility in formatting the metadata of a publication into a semi-structured citation string and expressing the citation string into its visual presentation in HTML. Furthermore, two citation strings with similar visual presentation on the same Web page may have different HTML constructs. In this paper, we present a content analysis approach based on Conditional Random Fields and data region boundary analysis to automatically extract citation record on a publication page. Experimental results show that our method performs well on a benchmark containing manually crafted publication Web pages. The precision, recall, and F-measure are 82.5%, 87.6%, and 85.0% respectively. This is an improvement over previous results.