Semi-supervised bibliographic element segmentation with latent permutations

Authors:
Tomonari Masada;Atsuhiro Takasu;Yuichiro Shibata;Kiyoshi Oguri
Affiliations:
Nagasaki University, Nagasaki-shi, Nagasaki, Japan;National Institute of Informatics, Chiyoda-ku, Tokyo, Japan;Nagasaki University, Nagasaki-shi, Nagasaki, Japan;Nagasaki University, Nagasaki-shi, Nagasaki, Japan
Venue:
ICADL'11 Proceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation
Year:
2011

Citing 8
Cited 0

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Bibliographic attribute extraction from erroneous references based on a statistical model

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Latent dirichlet allocation

The Journal of Machine Learning Research
Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
A simple method for citation metadata extraction using hidden markov models

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Global models of document structure using latent permutations

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Metadata extraction from bibliographies using bigram HMM

ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
Unsupervised Segmentation of Bibliographic Elements with Latent Permutations

International Journal of Organizational and Collective Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a semi-supervised bibliographic element segmentation. Our input data is a large scale set of bibliographic references each given as an unsegmented sequence of word tokens. Our problem is to segment each reference into bibliographic elements, e.g. authors, title, journal, pages, etc. We solve this problem with an LDA-like topic model by assigning each word token to a topic so that the word tokens assigned to the same topic refer to the same bibliographic element. Topic assignments should satisfy contiguity constraint, i.e., the constraint that the word tokens assigned to the same topic should be contiguous. Therefore, we proposed a topic model in our preceding work [8] based on the topic model devised by Chen et al. [3]. Our model extends LDA and realizes unsupervised topic assignments satisfying contiguity constraint. The main contribution of this paper is the proposal of a semi-supervised learning for our proposed model. We assume that at most one third of word tokens are already labeled. In addition, we assume that a few percent of the labels may be incorrect. The experiment showed that our semi-supervised learning improved the unsupervised learning by a large margin and achieved an over 90% segmentation accuracy.