A linear space algorithm for computing maximal common subsequences
Communications of the ACM
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Information extraction from research papers using conditional random fields
Information Processing and Management: an International Journal
Reference metadata extraction using a hierarchical knowledge representation framework
Decision Support Systems
FireCite: lightweight real-time reference string extraction from webpages
NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
Locating and parsing bibliographic references in HTML medical articles
International Journal on Document Analysis and Recognition - Special Issue DRR09
Self-supervised learning approach for extracting citation information on the web
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Hi-index | 0.00 |
A publication record denotes a list of semi-structured citation string of publications of a research institute or an individual researcher. Publication records are integrated into a digital library to become an important knowledge base which in turn enables a variety of applications. A publication record is usually found among other information on a publication Web page (or publication page for short). It is thus an interesting problem to extract publication record from these Web pages. The problem is difficult due to several reasons including the flexibility in formatting the metadata of a publication into a semi-structured citation string and expressing the citation string into its visual presentation in HTML. Furthermore, two citation strings with similar visual presentation on the same Web page may have different HTML constructs. In this paper, we present a content analysis approach based on Conditional Random Fields and data region boundary analysis to automatically extract citation record on a publication page. Experimental results show that our method performs well on a benchmark containing manually crafted publication Web pages. The precision, recall, and F-measure are 82.5%, 87.6%, and 85.0% respectively. This is an improvement over previous results.