Enabling search for facts and implied facts in historical documents

Authors:
David W. Embley;Spencer Machado;Thomas Packer;Joseph Park;Andrew Zitzelberger;Stephen W. Liddle;Nathan Tate;Deryle W. Lonsdale
Affiliations:
Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah
Venue:
Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Year:
2011

Citing 16
Cited 1

Object-oriented systems analysis: a model-driven approach

Object-oriented systems analysis: a model-driven approach
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Basic description logics

The description logic handbook
Grouping search-engine returned citations for person-name queries

Proceedings of the 6th annual ACM international workshop on Web information and data management
Towards Ontology Generation from Tables

World Wide Web
Adaptive information extraction

ACM Computing Surveys (CSUR)
A composite approach to automating direct and indirect schema mappings

Information Systems
Structured retrieval for question answering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Information Extraction

Foundations and Trends in Databases
Programming with data frames for everyday data items

AFIPS '80 Proceedings of the May 19-22, 1980, national computer conference
Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Data & Knowledge Engineering
Towards Linguistically Grounded Ontologies

ESWC 2009 Heraklion Proceedings of the 6th European Semantic Web Conference on The Semantic Web: Research and Applications
FOCIH: Form-Based Ontology Creation and Information Harvesting

ER '09 Proceedings of the 28th International Conference on Conceptual Modeling
A methodology to learn ontological attributes from the Web

Data & Knowledge Engineering
Extracting person names from diverse and noisy OCR text

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment

The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition

Pattern Recognition

Quantified Score

Hi-index	0.01

Visualization

Abstract

Building a database of facts extracted from historical documents to enable database-like query and search would reduce the tedium of gleaning facts of interest from historical documents. We propose a solution in which historical documents themselves constitute the stored database. In our solution, we use information-extraction techniques to produce a conceptualized external annotation of facts found in each document, and we superimpose the conceptualization over the document collection. The annotation process populates the conceptualization producing a repository of extracted facts, and a reasoner obtains inferred facts from these extracted facts. Our query interface accepts free-form queries and converts them to formal queries over the extracted and inferred facts. Displayed results include, in addition to standard query results, images of original documents with results highlighted along with reasoning chains for inferred facts grounded in these highlighted facts. Along with giving the implementation status of our proof-of-concept prototype, we present results for extraction accuracy and efficiency and point to current and future work needed to enable a practical solution for the envisioned historical-document database.