Integrating conflicting data: the role of source dependence

Authors:
Xin Luna Dong;Laure Berti-Equille;Divesh Srivastava
Affiliations:
AT&T Labs--Research, Florham Park, NJ;Université de Rennes, Rennes cedex, France;AT&T Labs--Research, Florham Park, NJ
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 8
Cited 39

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
The Eigentrust algorithm for reputation management in P2P networks

WWW '03 Proceedings of the 12th international conference on World Wide Web
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
TrustMe: Anonymous Management of Trust Relationships in Decentralized P2P Systems

P2P '03 Proceedings of the 3rd International Conference on Peer-to-Peer Computing
Link analysis ranking: algorithms, theory, and experiments

ACM Transactions on Internet Technology (TOIT)
Truth discovery with multiple conflicting information providers on the web

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Curated databases

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Data fusion: resolving data conflicts for integration

Proceedings of the VLDB Endowment
Truth discovery and copying detection in a dynamic world

Proceedings of the VLDB Endowment
Consistent query answers in inconsistent probabilistic databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Querying data provenance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Provenance-based belief

TAPP'10 Proceedings of the 2nd conference on Theory and practice of provenance
Redundancy-driven web data extraction and integration

Procceedings of the 13th International Workshop on the Web and Databases
A generic framework for handling uncertain data with local correlations

Proceedings of the VLDB Endowment
Probabilistic models to reconcile complex data from inaccurate data sources

CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
Record linkage with uniqueness constraints and erroneous values

Proceedings of the VLDB Endowment
Global detection of complex copying relationships between sources

Proceedings of the VLDB Endowment
SOLOMON: seeking the truth via copying detection

Proceedings of the VLDB Endowment
Factal: integrating deep web based on trust and relevance

Proceedings of the 20th international conference companion on World wide web
Semi-supervised truth discovery

Proceedings of the 20th international conference on World wide web
SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement

Proceedings of the 20th international conference on World wide web
Characterizing the uncertainty of web data: models and experiences

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Solomon: seeking the truth via copying detection

Proceedings of the 2nd International Workshop on Business intelligencE and the WEB
Efficient query answering in probabilistic RDF graphs

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Large-scale copy detection

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Heterogeneous network-based trust analysis: a survey

ACM SIGKDD Explorations Newsletter
Some thoughts on using argumentation to handle trust

CLIMA'11 Proceedings of the 12th international conference on Computational logic in multi-agent systems
Conflict-aware historical data fusion

SUM'11 Proceedings of the 5th international conference on Scalable uncertainty management
Improving data quality by source analysis

Journal of Data and Information Quality (JDIQ)
A Bayesian approach to discovering truth from conflicting sources for data integration

Proceedings of the VLDB Endowment
CDAS: a crowdsourcing data analytics system

Proceedings of the VLDB Endowment
On the foundations of probabilistic information integration

Proceedings of the 21st ACM international conference on Information and knowledge management
Predicting website correctness from consensus analysis

Proceedings of the 2012 ACM Research in Applied Computation Symposium
Using argumentation to reason with and about trust

ArgMAS'11 Proceedings of the 8th international conference on Argumentation in Multi-Agent Systems
Web data reconciliation: models and experiences

Search Computing
Numeric Query Answering on the Web

International Journal on Semantic Web & Information Systems
Data Linking for the Semantic Web

International Journal on Semantic Web & Information Systems
Less is more: selecting sources wisely for integration

Proceedings of the VLDB Endowment
Truth finding on the deep web: is the problem solved?

Proceedings of the VLDB Endowment
Assessing relevance and trust of the deep web sources and results based on inter-source agreement

ACM Transactions on the Web (TWEB)
Reasoning about uncertain information and conflict resolution through trust revision

Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems
Compact explanation of data fusion decisions

Proceedings of the 22nd international conference on World Wide Web
Mining collective intelligence in diverse groups

Proceedings of the 22nd international conference on World Wide Web
Data fusion: resolving conflicts from multiple sources

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Aggregating semantic annotators

Proceedings of the VLDB Endowment
Agreement based source selection for the multi-topic deep web integration

Proceedings of the 17th International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values and different sources can often provide conflicting values. To present quality data to users, it is critical that data integration systems can resolve conflicts and discover true values. Typically, we expect a true value to be provided by more sources than any particular false one, so we can take the value provided by the majority of the sources as the truth. Unfortunately, a false value can be spread through copying and that makes truth discovery extremely tricky. In this paper, we consider how to find true values from conflicting information when there are a large number of sources, among which some may copy from others. We present a novel approach that considers dependence between data sources in truth discovery. Intuitively, if two data sources provide a large number of common values and many of these values are rarely provided by other sources (e.g., particular false values), it is very likely that one copies from the other. We apply Bayesian analysis to decide dependence between sources and design an algorithm that iteratively detects dependence and discovers truth from conflicting information. We also extend our model by considering accuracy of data sources and similarity between values. Our experiments on synthetic data as well as real-world data show that our algorithm can significantly improve accuracy of truth discovery and is scalable when there are a large number of data sources.