Conceptual-model-based data extraction from multiple-record Web pages
Data & Knowledge Engineering
Named entity extraction from noisy input: speech and OCR
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Introduction to the CoNLL-2003 shared task: language-independent named entity recognition
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Named entity recognition with a maximum entropy approach
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Meta-learning orthographic and contextual models for language independent named entity recognition
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Improving optical character recognition through efficient multiple system alignment
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
NERA: Named Entity Recognition for Arabic
Journal of the American Society for Information Science and Technology
Design challenges and misconceptions in named entity recognition
CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Entity extraction is a boring solved problem: or is it?
NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
ETL ensembles for chunking, NER and SRL
CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Enabling search for facts and implied facts in historical documents
Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Hi-index | 0.01 |
Named entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text because of the marked presence of word errors and absence of page layout information. How difficult can it be and what level of quality can be expected? We apply three typical extraction algorithms to the task of extracting person names from multiple types of noisy OCR documents found in the collection of a major genealogy content provider and compare their performance using a number of quality metrics. We also show an improvement in extraction quality using a majority-vote ensemble of the three extractors. We evaluate the extraction quality with respect to two references: what a human can manually extract from OCR output and from the original document images. We illustrate the challenges and opportunities at hand for extracting names from OCRed data and identify directions for further improvement.