Multi-document statistical fact extraction and fusion

Authors:
David Yarowsky;Gideon S. Mann
Affiliations:
The Johns Hopkins University;The Johns Hopkins University
Venue:
Multi-document statistical fact extraction and fusion
Year:
2006

Citing 0
Cited 6

The SemEval-2007 WePS evaluation: establishing a benchmark for the web people search task

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
The role of named entities in web people search

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
A robust web personal name information extraction system

Expert Systems with Applications: An International Journal
Heuristic algorithm for extraction of facts using relational model and syntactic data

MICAI'11 Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
Combining evaluation metrics via the unanimous improvement ratio and its application to clustering tasks

Journal of Artificial Intelligence Research
Discovering filter keywords for company name disambiguation in twitter

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This dissertation presents original techniques for statistical fact extraction and fusion from multiple documents. Fact extraction, or relationship extraction, is a process where natural language text is scanned to find instances of a predetermined class of facts (e.g. birthday(x,y)). A framework for training statistical fact extractors from example is used wherein a set of examples and a target model are used to annotate an automatically collected corpus. This annotation is then used to provide training data for classifiers (Phrase Conditional Likelihood and Native Bayes) or sequence models (Conditional Random Fields). Fact extractors are used in two information retrieval tasks. In question answering the set of candidate answers is narrowed using fine-grained proper noun ontological facts (is-a(X, Y)) extracted from a corpus by rote classifiers leading to higher performance. Extracted facts are also used for name-referent disambiguation, or cross-document coreference, where one personal name may refer to multiple potential people in the world. The distinguishing biographic facts for each person, such as birthday(x,y) and occupation (x,y), are automatically extracted from plain text and these biographic facts are used along with other statistical methods to distinguish between mentions of each of the referents. This dissertation presents novel techniques for fusion which integrate facts extracted from multiple sources. For the task of biographic fact extraction, fusion of factual information extracted from multiple documents improves the precision of the resulting information. Further improvements result from cascaded fact extraction, where certain facts are extracted and fused and then these facts are used to extract additional information. The technique of cascaded fact extraction and fusion is also applied to time-bounded facts, where a cascade of fact extractors produce a timeline of corporate management succession. Collectively, this research demonstrates the utility of multi-document fact extraction and fusion. It shows that facts can serve as a building-block for deeper text processing such as finding coreferent names in a series of documents, finding the answers to questions, and constructing a timeline for time-variable facts. The key aspects to the process are training with minimal supervision, high-performance statistical fact extraction, fusion across multiple sources of information, and cascaded extraction.