Creating probabilistic databases from information extraction models

Authors:
Rahul Gupta;Sunita Sarawagi
Affiliations:
IBM Research Lab, New Delhi, India and IIT Bombay;Indian Institute of Technology, Bombay, India
Venue:
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Year:
2006

Citing 23
Cited 49

A probabilistic relational model and algebra

ACM Transactions on Database Systems (TODS)
ProbView: a flexible probabilistic database system

ACM Transactions on Database Systems (TODS)
An introduction to variational methods for graphical models

Learning in graphical models
Improving the mean field approximation via the use of mixture distributions

Learning in graphical models
Learning to Parse Natural Language with Maximum Entropy Models

Machine Learning - Special issue on natural language learning
Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
The Management of Probabilistic Data

IEEE Transactions on Knowledge and Data Engineering
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Probabilistic Framework for Vague Queries and Imprecise Information in Databases

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
Evaluating probabilistic queries over imprecise data

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Aggregate operators in probabilistic databases

Journal of the ACM (JACM)
MYSTIQ: a system for finding more answers by using probabilities

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
OLAP over uncertain and imprecise data

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Working Models for Uncertain Data

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Information extraction from research papers using conditional random fields

Information Processing and Management: an International Journal
Efficient inference on sequence segmentation models

ICML '06 Proceedings of the 23rd international conference on Machine learning
Clustering with Bregman Divergences

The Journal of Machine Learning Research
Efficient query evaluation on probabilistic databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Tree-based reparameterization framework for analysis of sum-product and related algorithms

IEEE Transactions on Information Theory

Management of probabilistic data: foundations and challenges

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Canonicalization of database records using adaptive similarity measures

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Management of data with uncertainties

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
OLAP over imprecise data with domain constraints

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Materialized views in probabilistic databases: for information exchange and query optimization

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Query processing over incomplete autonomous databases

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Probabilistic graphical models and their role in databases

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
MCDB: a monte carlo approach to managing uncertain data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Toward best-effort information extraction

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Probabilistic databases

ACM SIGACT News
Parameter Learning in Probabilistic Databases: A Least Squares Approach

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
BayesStore: managing large, uncertain data repositories with probabilistic graphical models

Proceedings of the VLDB Endowment
Exploiting shared correlations in probabilistic databases

Proceedings of the VLDB Endowment
Systems aspects of probabilistic data management

Proceedings of the VLDB Endowment
Information Extraction

Foundations and Trends in Databases
A quality-aware optimizer for information extraction

ACM Transactions on Database Systems (TODS)
Probabilistic databases: diamonds in the dirt

Communications of the ACM - Barbara Liskov: ACM's A.M. Turing Award Winner
A web of concepts

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Consensus answers for queries over probabilistic databases

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Uncertainty management in rule-based information extraction systems

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Indexing correlated probabilistic databases

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Large-scale uncertainty management systems: learning and exploiting your data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
The trichotomy of HAVING queries on a probabilistic database

The VLDB Journal — The International Journal on Very Large Data Bases
Query processing over incomplete autonomous databases: query rewriting using learned data dependencies

The VLDB Journal — The International Journal on Very Large Data Bases
$${10^{(10^{6})}}$$ worlds and beyond: efficient representation and processing of incomplete information

The VLDB Journal — The International Journal on Very Large Data Bases
Creating probabilistic databases from duplicated data

The VLDB Journal — The International Journal on Very Large Data Bases
PrDB: managing and exploiting rich correlations in probabilistic databases

The VLDB Journal — The International Journal on Very Large Data Bases
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
A unified approach to ranking in probabilistic databases

Proceedings of the VLDB Endowment
Entity-aware query processing for heterogeneous data with uncertainty and correlations

Proceedings of the 2009 EDBT/ICDT Workshops
Efficient evaluation of HAVING queries on a probabilistic database

DBPL'07 Proceedings of the 11th international conference on Database programming languages
GRN model of probabilistic databases: construction, transition and querying

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Lineage processing over correlated probabilistic databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
I4E: interactive investigation of iterative information extraction

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Querying graphs with uncertain predicates

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
Set similarity join on probabilistic data

Proceedings of the VLDB Endowment
Querying probabilistic information extraction

Proceedings of the VLDB Endowment
Tractability in probabilistic databases

Proceedings of the 14th International Conference on Database Theory
A unified approach to ranking in probabilistic databases

The VLDB Journal — The International Journal on Very Large Data Bases
Sensitivity analysis and explanations for robust query evaluation in probabilistic databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
The monte carlo database system: Stochastic analysis close to the data

ACM Transactions on Database Systems (TODS)
Probabilistic management of OCR data using an RDBMS

Proceedings of the VLDB Endowment
Efficient processing of probabilistic set-containment queries on uncertain set-valued data

Information Sciences: an International Journal
Towards a unified architecture for in-RDBMS analytics

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
P-top-k queries in a probabilistic framework from information extraction models

Computers & Mathematics with Applications
Ontology-based access to probabilistic data with OWL QL

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Optimal hashing schemes for entity matching

Proceedings of the 22nd international conference on World Wide Web
Top-k entities query processing on uncertainly fused multi-sensory data

Personal and Ubiquitous Computing
Anytime approximation in probabilistic databases

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many real-life applications depend on databases automatically curated from unstructured sources through imperfect structure extraction tools. Such databases are best treated as imprecise representations of multiple extraction possibli-ties. State-of-the-art statistical models of extraction provide a sound probability distribution over extractions but are not easy to represent and query in a relational framework. In this paper we address the challenge of approximating such distributions as imprecise data models. In particular, we investigate a model that captures both row-level and column-level uncertainty and show that this representation provides significantly better approximation compared to models that use only row or only column level uncertainty. We present efficient algorithms for finding the best approximating parameters for such a model: our algorithm exploits the structure of the model to avoid enumerating the exponential number of extraction possibilities.