On the Recognition of Printed Characters of Any Font and Size
IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Machine Learning
Hidden Markov Models for Text Categorization in Multi-Page Documents
Journal of Intelligent Information Systems
Automatic document metadata extraction using support vector machines
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Bibliographic attribute extraction from erroneous references based on a statistical model
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
DVHMM: Variable Length Text Recognition Error Model
ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 3 - Volume 3
Information extraction from biomedical literature: methodology, evaluation and an application
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Extracting semantic structure of web documents using content and visual information
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of titles from general documents using machine learning
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Rule-based word clustering for document metadata extraction
Proceedings of the 2005 ACM symposium on Applied computing
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Blueprint for a high performance NLP infrastructure
SEALTS '03 Proceedings of the HLT-NAACL 2003 workshop on Software engineering and architecture of language technology systems - Volume 8
Automating semantic markup of semi-structured text via an induced knowledge base: a case study using floras
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Treebanks gone bad: Parser evaluation and retraining using a treebank of ungrammatical sentences
International Journal on Document Analysis and Recognition
International Journal of Metadata, Semantics and Ontologies
Cost effective ontology population with data from lists in OCRed historical documents
Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing
Hi-index | 0.00 |
This paper describes the information properties of museum specimen labels and machine learning tools to automatically extract Darwin Core (DwC) and other metadata from these labels processed through Optical Character Recognition (OCR). The DwC is a metadata profile describing the core set of access points for search and retrieval of natural history collections and observation databases. Using the HERBIS Learning System (HLS) we extract 74 independent elements from these labels. The automated text extraction tools are provided as a web service so that users can reference digital images of specimens and receive back an extended Darwin Core XML representation of the content of the label. This automated extraction task is made more difficult by the high variability of museum label formats, OCR errors and the open class nature of some elements. In this paper we introduce our overall system architecture, and variability robust solutions including, the application of Hidden Markov and Naïve Bayes machine learning models, data cleaning, use of field element identifiers, and specialist learning models. The techniques developed here could be adapted to any metadata extraction situation with noisy text and weakly ordered elements.