Cost effective ontology population with data from lists in OCRed historical documents

Authors:
Thomas L. Packer;David W. Embley
Affiliations:
Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah
Venue:
Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing
Year:
2013

Citing 9
Cited 0

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Semi-Automatic Wrapper Generation for Internet Information Sources

COOPIS '97 Proceedings of the Second IFCIS International Conference on Cooperative Information Systems
Wrapper induction for information extraction

Wrapper induction for information extraction
A Segmentation Method for Bibliographic References by Contextual Tagging of Fields

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Citation Recognition for Scientific Publications in Digital Libraries

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Automatic metadata extraction from museum specimen labels

DCMI '08 Proceedings of the 2008 International Conference on Dublin Core and Metadata Applications
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

A method of automatically extracting facts from lists in OCRed documents and inserting them into an ontology would contribute to making a variety of historical knowledge machine searchable, queryable, and linkable. To work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its selection of human guidance. We propose ListReader, a wrapper-induction solution for information extraction that is specialized for lists in OCRed documents. ListReader can induce either a regular-expression grammar or a Hidden Markov Model. Each can infer list structure and field labels from OCR text. We decrease the cost and improve the accuracy of the induction process using semi-supervised machine learning and active learning, allowing induction of a wrapper from almost a single hand-labeled instance per field per list. After applying an induced wrapper, ListReader automatically maps the labeled text it produces to a rich variety of ontologically structured predicates. We evaluate our implementation on family history books in terms of the typical F-measure and a new metric, "Label Efficiency", which measures both extraction quality and cost in a single number. We show with statistical significance that ListReader reaches values closer to optimal levels than a state-of-the-art statistical sequence labeler.