Automatic metadata extraction from museum specimen labels

Authors:
P. Bryan Heidorn;Qin Wei
Affiliations:
University of Illinois, Champaign, IL;University of Illinois, Champaign, IL
Venue:
DCMI '08 Proceedings of the 2008 International Conference on Dublin Core and Metadata Applications
Year:
2008

Citing 17
Cited 1

On the Recognition of Printed Characters of Any Font and Size

IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Machine Learning

Machine Learning
Hidden Markov Models for Text Categorization in Multi-Page Documents

Journal of Intelligent Information Systems
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Bibliographic attribute extraction from erroneous references based on a statistical model

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
DVHMM: Variable Length Text Recognition Error Model

ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 3 - Volume 3
Information extraction from biomedical literature: methodology, evaluation and an application

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Extracting semantic structure of web documents using content and visual information

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of titles from general documents using machine learning

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Rule-based word clustering for document metadata extraction

Proceedings of the 2005 ACM symposium on Applied computing
Bootstrapping

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Blueprint for a high performance NLP infrastructure

SEALTS '03 Proceedings of the HLT-NAACL 2003 workshop on Software engineering and architecture of language technology systems - Volume 8
Automating semantic markup of semi-structured text via an induced knowledge base: a case study using floras

Automating semantic markup of semi-structured text via an induced knowledge base: a case study using floras
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Treebanks gone bad: Parser evaluation and retraining using a treebank of ungrammatical sentences

International Journal on Document Analysis and Recognition
Functionalities for automatic metadata generation applications: a survey of metadata experts' opinions

International Journal of Metadata, Semantics and Ontologies

Cost effective ontology population with data from lists in OCRed historical documents

Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the information properties of museum specimen labels and machine learning tools to automatically extract Darwin Core (DwC) and other metadata from these labels processed through Optical Character Recognition (OCR). The DwC is a metadata profile describing the core set of access points for search and retrieval of natural history collections and observation databases. Using the HERBIS Learning System (HLS) we extract 74 independent elements from these labels. The automated text extraction tools are provided as a web service so that users can reference digital images of specimens and receive back an extended Darwin Core XML representation of the content of the label. This automated extraction task is made more difficult by the high variability of museum label formats, OCR errors and the open class nature of some elements. In this paper we introduce our overall system architecture, and variability robust solutions including, the application of Hidden Markov and Naïve Bayes machine learning models, data cleaning, use of field element identifiers, and specialist learning models. The techniques developed here could be adapted to any metadata extraction situation with noisy text and weakly ordered elements.