Metadata extraction from bibliographies using bigram HMM

  • Authors:
  • Ping Yin;Ming Zhang;ZhiHong Deng;DongQing Yang

  • Affiliations:
  • School of Electronics Engineering and Computer Science, Peking University, Beijing, China;School of Electronics Engineering and Computer Science, Peking University, Beijing, China;School of Electronics Engineering and Computer Science, Peking University, Beijing, China;School of Electronics Engineering and Computer Science, Peking University, Beijing, China

  • Venue:
  • ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

In recent years, we have seen huge volumes of research papers available on the World Wide Web. Metadata provides a good approach for organizing and retrieving these useful resources. Accordingly, automatic extraction of metadata from these papers and their bibliographies is meaningful and has been widely studied. In this paper, we utilize a bigram HMM (Hidden Markov Model) for automatic extraction of metadata (i.e. title, author, date, journal, pages, etc.) from bibliographies with various styles. Different from the traditional HMM, which only uses word frequency, this model also considers both words' bigram sequential relation and position information in text fields. We have evaluated the model on a real corpus downloaded from Web and compared it with other methods. Experiments show that the bigram HMM yields the best result and seem to be the most promising candidate for metadata extraction of bibliographies.