The SemEval-2007 WePS evaluation: establishing a benchmark for the web people search task
SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
The role of named entities in web people search
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
A robust web personal name information extraction system
Expert Systems with Applications: An International Journal
Heuristic algorithm for extraction of facts using relational model and syntactic data
MICAI'11 Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
Journal of Artificial Intelligence Research
Discovering filter keywords for company name disambiguation in twitter
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
This dissertation presents original techniques for statistical fact extraction and fusion from multiple documents. Fact extraction, or relationship extraction, is a process where natural language text is scanned to find instances of a predetermined class of facts (e.g. birthday(x,y)). A framework for training statistical fact extractors from example is used wherein a set of examples and a target model are used to annotate an automatically collected corpus. This annotation is then used to provide training data for classifiers (Phrase Conditional Likelihood and Native Bayes) or sequence models (Conditional Random Fields). Fact extractors are used in two information retrieval tasks. In question answering the set of candidate answers is narrowed using fine-grained proper noun ontological facts (is-a(X, Y)) extracted from a corpus by rote classifiers leading to higher performance. Extracted facts are also used for name-referent disambiguation, or cross-document coreference, where one personal name may refer to multiple potential people in the world. The distinguishing biographic facts for each person, such as birthday(x,y) and occupation (x,y), are automatically extracted from plain text and these biographic facts are used along with other statistical methods to distinguish between mentions of each of the referents. This dissertation presents novel techniques for fusion which integrate facts extracted from multiple sources. For the task of biographic fact extraction, fusion of factual information extracted from multiple documents improves the precision of the resulting information. Further improvements result from cascaded fact extraction, where certain facts are extracted and fused and then these facts are used to extract additional information. The technique of cascaded fact extraction and fusion is also applied to time-bounded facts, where a cascade of fact extractors produce a timeline of corporate management succession. Collectively, this research demonstrates the utility of multi-document fact extraction and fusion. It shows that facts can serve as a building-block for deeper text processing such as finding coreferent names in a series of documents, finding the answers to questions, and constructing a timeline for time-variable facts. The key aspects to the process are training with minimal supervision, high-performance statistical fact extraction, fusion across multiple sources of information, and cascaded extraction.