SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Semi-Automatic Wrapper Generation for Internet Information Sources
COOPIS '97 Proceedings of the Second IFCIS International Conference on Cooperative Information Systems
Wrapper induction for information extraction
Wrapper induction for information extraction
A Segmentation Method for Bibliographic References by Contextual Tagging of Fields
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Citation Recognition for Scientific Publications in Digital Libraries
DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Automatic metadata extraction from museum specimen labels
DCMI '08 Proceedings of the 2008 International Conference on Dublin Core and Metadata Applications
Answering table augmentation queries from unstructured lists on the web
Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web
Proceedings of the VLDB Endowment
Automatic wrappers for large scale web extraction
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
A method of automatically extracting facts from lists in OCRed documents and inserting them into an ontology would contribute to making a variety of historical knowledge machine searchable, queryable, and linkable. To work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its selection of human guidance. We propose ListReader, a wrapper-induction solution for information extraction that is specialized for lists in OCRed documents. ListReader can induce either a regular-expression grammar or a Hidden Markov Model. Each can infer list structure and field labels from OCR text. We decrease the cost and improve the accuracy of the induction process using semi-supervised machine learning and active learning, allowing induction of a wrapper from almost a single hand-labeled instance per field per list. After applying an induced wrapper, ListReader automatically maps the labeled text it produces to a rich variety of ontologically structured predicates. We evaluate our implementation on family history books in terms of the typical F-measure and a new metric, "Label Efficiency", which measures both extraction quality and cost in a single number. We show with statistical significance that ListReader reaches values closer to optimal levels than a state-of-the-art statistical sequence labeler.