Data integration with uncertainty

Authors:
Xin Luna Dong;Alon Halevy;Cong Yu
Affiliations:
AT & T Labs-Research, Florham Park, USA 07932;Google Inc., Mountain View, USA 94043;Yahoo! Research, New York, USA 10018
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2009

Citing 25
Cited 25

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Complexity of answering queries using materialized views

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Data integration: a theoretical perspective

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Using Probabilistic Information in Data Integration

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Answering queries using views: A survey

The VLDB Journal — The International Journal on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Data exchange: getting to the core

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
Enterprise information integration: successes, challenges and controversies

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Foundations of probabilistic answers to queries

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Composing schema mappings: Second-order dependencies to the rescue

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2004
Principles of dataspace systems

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Inverting schema mappings

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Supporting ad-hoc ranking aggregates

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Data integration: the teenage years

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Implementing mapping composition

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
ULDBs: databases with uncertainty and lineage

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Information retrieval and machine learning for probabilistic schema matching

Information Processing and Management: an International Journal
Why is schema matching tough and what can we do about it?

ACM SIGMOD Record
Discover: keyword search in relational databases

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Composing mappings among data sources

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Data integration with uncertainty

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Bootstrapping pay-as-you-go data integration systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Approximate data exchange

ICDT'07 Proceedings of the 11th international conference on Database Theory
World-set decompositions: expressiveness and efficient algorithms

ICDT'07 Proceedings of the 11th international conference on Database Theory

Feedback-based annotation, selection and refinement of schema mappings for dataspaces

Proceedings of the 13th International Conference on Extending Database Technology
Updating probabilistic XML

Proceedings of the 2010 EDBT/ICDT Workshops
Probabilistic data exchange

Proceedings of the 13th International Conference on Database Theory
Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Machine reading at the University of Washington

FAM-LbR '10 Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading
Automatic schema merging using mapping constraints among incomplete sources

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Combining logic and probabilities for discovering mappings between taxonomies

KSEM'10 Proceedings of the 4th international conference on Knowledge science, engineering and management
Automatic generation of probabilistic relationships for improving schema matching

Information Systems
Set similarity join on probabilistic data

Proceedings of the VLDB Endowment
Value joins are expensive over (probabilistic) XML

Proceedings of the 4th International Workshop on Logic in Databases
Efficient query answering in probabilistic RDF graphs

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Probabilistic data exchange

Journal of the ACM (JACM)
The monte carlo database system: Stochastic analysis close to the data

ACM Transactions on Database Systems (TODS)
Efficient processing of probabilistic set-containment queries on uncertain set-valued data

Information Sciences: an International Journal
Chapter 7: dataspaces

Search Computing
Efficient subject-oriented evaluating and mining methods for data with schema uncertainty

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I
Efficient management of uncertainty in XML schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Answering queries using views over probabilistic XML: complexity and tractability

Proceedings of the VLDB Endowment
On the foundations of probabilistic information integration

Proceedings of the 21st ACM international conference on Information and knowledge management
Non-binary evaluation for schema matching

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Indeterministic Handling of Uncertain Decisions in Deduplication

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Incrementally improving dataspaces based on user feedback

Information Systems
A compact representation for efficient uncertain-information integration

Proceedings of the 17th International Database Engineering & Applications Symposium
Reducing uncertainty of schema matching via crowdsourcing

Proceedings of the VLDB Endowment
Schema matching prediction with applications to data source discovery and dynamic ensembling

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper reports our first set of results on managing uncertainty in data integration. We posit that data-integration systems need to handle uncertainty at three levels and do so in a principled fashion. First, the semantic mappings between the data sources and the mediated schema may be approximate because there may be too many of them to be created and maintained or because in some domains (e.g., bioinformatics) it is not clear what the mappings should be. Second, the data from the sources may be extracted using information extraction techniques and so may yield erroneous data. Third, queries to the system may be posed with keywords rather than in a structured form. As a first step to building such a system, we introduce the concept of probabilistic schema mappings and analyze their formal foundations. We show that there are two possible semantics for such mappings: by-table semantics assumes that there exists a correct mapping but we do not know what it is; by-tuple semantics assumes that the correct mapping may depend on the particular tuple in the source data. We present the query complexity and algorithms for answering queries in the presence of probabilistic schema mappings, and we describe an algorithm for efficiently computing the top-k answers to queries in such a setting. Finally, we consider using probabilistic mappings in the scenario of data exchange.