Creating probabilistic databases from information extraction models

  • Authors:
  • Rahul Gupta;Sunita Sarawagi

  • Affiliations:
  • IBM Research Lab, New Delhi, India and IIT Bombay;Indian Institute of Technology, Bombay, India

  • Venue:
  • VLDB '06 Proceedings of the 32nd international conference on Very large data bases
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many real-life applications depend on databases automatically curated from unstructured sources through imperfect structure extraction tools. Such databases are best treated as imprecise representations of multiple extraction possibli-ties. State-of-the-art statistical models of extraction provide a sound probability distribution over extractions but are not easy to represent and query in a relational framework. In this paper we address the challenge of approximating such distributions as imprecise data models. In particular, we investigate a model that captures both row-level and column-level uncertainty and show that this representation provides significantly better approximation compared to models that use only row or only column level uncertainty. We present efficient algorithms for finding the best approximating parameters for such a model: our algorithm exploits the structure of the model to avoid enumerating the exponential number of extraction possibilities.