A comparison of layout based bibliographic metadata extraction techniques

Authors:
Michael Granitzer;Maya Hristakeva;Robert Knight;Kris Jack;Roman Kern
Affiliations:
University of Passau, Passau, Germany;Mendeley Ltd., London, UK;Mendeley Ltd., London, UK;Mendeley Ltd., London, UK;Graz University of Technology, Graz, Austria
Venue:
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Year:
2012

Citing 5
Cited 0

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Rule-based word clustering for document metadata extraction

Proceedings of the 2005 ACM symposium on Applied computing
Open information extraction using Wikipedia

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management

Proceedings of the 27th Annual ACM Symposium on Applied Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Social research networks such as Mendeley and CiteULike offer various services for collaboratively managing bibliographic metadata. Compared with traditional libraries, metadata quality is of crucial importance in order to create a crowdsourced bibliographic catalog for search and browsing. Artifacts, in particular PDFs which are managed by the users of the social research networks, become one important metadata source and the starting point for creating a homogeneous, high quality, bibliographic catalog. Natural Language Processing and Information Extraction techniques have been employed to extract structured information from unstructured sources. However, given highly heterogeneous artifacts that cover a range of publication styles, stemming from different publication sources, and imperfect PDF processing tools, how accurate are metadata extraction methods in such real-world settings? This paper focuses on answering that question by investigating the use of Conditional Random Fields and Support Vector Machines on real-world data gathered from Mendeley and Linked-Data repositories. We compare style and content features on existing state-of-the-art methods on two newly created real-world data sets for metadata extraction. Our analysis shows that two-stage SVMs provide reasonable performance in solving the challenge of metadata extraction for crowdsourcing bibliographic metadata management.