Redundancy in web-scaled information extraction: probabilistic model and experimental results

  • Authors:
  • Oren Etzioni;Douglas C. Downey

  • Affiliations:
  • University of Washington;University of Washington

  • Venue:
  • Redundancy in web-scaled information extraction: probabilistic model and experimental results
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Information Extraction (IE) is the task of automatically extracting knowledge from text. The massive body of text now available on the World Wide Web presents an unprecedented opportunity for IE. IE systems promise to encode vast quantities of Web content into machine-processable knowledge bases, presenting a new approach to a fundamental challenge for artificial intelligence: the automatic acquisition of massive bodies of knowledge. Such knowledge bases would dramatically extend the capabilities of Web applications. Future Web search engines, for example, could query the knowledge bases to answer complicated questions that require synthesizing information across multiple Web pages. However, IE on the Web is challenging due to the enormous variety of distinct concepts expressed. All extraction techniques make errors, and the standard error-detection strategy used in previous, small-corpus extraction systems hand-labeling examples of each concept to be extracted, then training a classifier using the labeled examples—is intractable on the Web. How can we automatically identify correct extractions for arbitrary target concepts, without hand-labeled examples? This thesis shows how IE on the Web is made possible through the KnowItAll hypothesis, which states that extractions that occur more frequently in distinct sentences in a corpus are more likely to be correct. The KnowItAll hypothesis holds on the Web, and can be used to identify many correct extractions because the Web is highly redundant: individual facts are often repeated many times, and in many different ways. In this thesis, we show that a probabilistic model of the KnowItAll hypothesis, coupled with the redundancy of the Web, can power effective IE for arbitrary target concepts without hand-labeled data. In experiments with IE on the Web, we show that the probabilities produced by our model are 15 times better, on average, when compared with techniques from previous work. We also prove formally that under the assumptions of the model, "Probably Approximately Correct" IE can be attained from only unlabeled data.