A trigram hidden Markov model for metadata extraction from heterogeneous references

Authors:
Bolanle Ojokoh;Ming Zhang;Jian Tang
Affiliations:
School of Electronic Engineering and Computer Science, Peking University, Beijing 100871, PR China and Department of Computer Science, Federal University of Technology, P.M.B. 704 Akure, Nigeria;School of Electronic Engineering and Computer Science, Peking University, Beijing 100871, PR China;School of Electronic Engineering and Computer Science, Peking University, Beijing 100871, PR China
Venue:
Information Sciences: an International Journal
Year:
2011

Citing 15
Cited 1

Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Digital Libraries and Autonomous Citation Indexing

Computer
Hidden Markov Model} Induction by Bayesian Model Merging

Advances in Neural Information Processing Systems 5, [NIPS Conference]
Nymble: a high-performance learning name-finder

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
AUTOBIB: Automatic Extraction of Bibliographic Information on the Web

IDEAS '04 Proceedings of the International Database Engineering and Applications Symposium
A second-order Hidden Markov Model for part-of-speech tagging

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Reference metadata extraction using a hierarchical knowledge representation framework

Decision Support Systems
FLUX-CIM: flexible unsupervised extraction of citation metadata

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Improving speaker identification performance under the shouted talking condition using the second-order hidden Markov models

EURASIP Journal on Applied Signal Processing
A simple method for citation metadata extraction using hidden markov models

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Anchor text extraction for academic search

NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
A non-linear index to evaluate a journal's scientific impact

Information Sciences: an International Journal
Unsupervised strategies for information extraction by text segmentation

Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database Research
Meta-metadata: a metadata semantics language for collection representation applications

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Metadata extraction from bibliographies using bigram HMM

ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization

Minimizing the ripple effect of web-centric software by using the pheromone extension

Information Sciences: an International Journal

Quantified Score

Hi-index	0.08

Visualization

Abstract

Our objective was to explore an efficient and accurate extraction of metadata such as author, title and institution from heterogeneous references, using hidden Markov models (HMMs). The major contributions of the research were the (i) development of a trigram, full second order hidden Markov model with more priority to words emitted in transitions to the same state, with a corresponding new Viterbi algorithm (ii) introduction of a new smoothing technique for transition probabilities and (iii) proposal of a modification of back-off shrinkage technique for emission probabilities. The effect of the size of data set on the training procedure was also measured. Comparisons were made with other related works and the model was evaluated with three different data sets. The results showed overall accuracy, precision, recall and F1 measure of over 95% suggesting that the method outperforms other related methods in the task of metadata extraction from references.