Metadata extraction from bibliographies using bigram HMM

Authors:
Ping Yin;Ming Zhang;ZhiHong Deng;DongQing Yang
Affiliations:
School of Electronics Engineering and Computer Science, Peking University, Beijing, China;School of Electronics Engineering and Computer Science, Peking University, Beijing, China;School of Electronics Engineering and Computer Science, Peking University, Beijing, China;School of Electronics Engineering and Computer Science, Peking University, Beijing, China
Venue:
ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
Year:
2004

Citing 7
Cited 8

Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Digital Libraries and Autonomous Citation Indexing

Computer
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Hidden Markov Model} Induction by Bayesian Model Merging

Advances in Neural Information Processing Systems 5, [NIPS Conference]
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Nymble: a high-performance learning name-finder

ANLC '97 Proceedings of the fifth conference on Applied natural language processing

A system for supporting evidence recording in bibliographic records: Research Articles

Journal of the American Society for Information Science and Technology
A simple method for citation metadata extraction using hidden markov models

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Building a scalable web query system

DNIS'07 Proceedings of the 5th international conference on Databases in networked information systems
A trigram hidden Markov model for metadata extraction from heterogeneous references

Information Sciences: an International Journal
Unsupervised segmentation of bibliographic elements with latent permutations

WISS'10 Proceedings of the 2010 international conference on Web information systems engineering
Semi-supervised bibliographic element segmentation with latent permutations

ICADL'11 Proceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation
On line course organization

ICWL'07 Proceedings of the 6th international conference on Advances in web based learning
Unsupervised Segmentation of Bibliographic Elements with Latent Permutations

International Journal of Organizational and Collective Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, we have seen huge volumes of research papers available on the World Wide Web. Metadata provides a good approach for organizing and retrieving these useful resources. Accordingly, automatic extraction of metadata from these papers and their bibliographies is meaningful and has been widely studied. In this paper, we utilize a bigram HMM (Hidden Markov Model) for automatic extraction of metadata (i.e. title, author, date, journal, pages, etc.) from bibliographies with various styles. Different from the traditional HMM, which only uses word frequency, this model also considers both words' bigram sequential relation and position information in text fields. We have evaluated the model on a real corpus downloaded from Web and compared it with other methods. Experiments show that the bigram HMM yields the best result and seem to be the most promising candidate for metadata extraction of bibliographies.