Extracting person names from diverse and noisy OCR text

Authors:
Thomas L. Packer;Joshua F. Lutes;Aaron P. Stewart;David W. Embley;Eric K. Ringger;Kevin D. Seppi;Lee S. Jensen
Affiliations:
Brigham Young University, Provo, UT, USA;Brigham Young University, Provo, UT, USA;Brigham Young University, Provo, UT, USA;Brigham Young University, Provo, UT, USA;Brigham Young University, Provo, UT, USA;Brigham Young University, Provo, UT, USA;Ancestry.com, Inc., Provo, UT, USA
Venue:
AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
Year:
2010

Citing 12
Cited 1

Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Named entity extraction from noisy input: speech and OCR

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Named entity recognition with a maximum entropy approach

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Meta-learning orthographic and contextual models for language independent named entity recognition

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Improving optical character recognition through efficient multiple system alignment

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
NERA: Named Entity Recognition for Arabic

Journal of the American Society for Information Science and Technology
Design challenges and misconceptions in named entity recognition

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Entity extraction is a boring solved problem: or is it?

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Investigator name recognition from medical journal articles: a comparative study of SVM and structural SVM

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
ETL ensembles for chunking, NER and SRL

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing

Enabling search for facts and implied facts in historical documents

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Named entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text because of the marked presence of word errors and absence of page layout information. How difficult can it be and what level of quality can be expected? We apply three typical extraction algorithms to the task of extracting person names from multiple types of noisy OCR documents found in the collection of a major genealogy content provider and compare their performance using a number of quality metrics. We also show an improvement in extraction quality using a majority-vote ensemble of the three extractors. We evaluate the extraction quality with respect to two references: what a human can manually extract from OCR output and from the original document images. We illustrate the challenges and opportunities at hand for extracting names from OCRed data and identify directions for further improvement.