A Bayesian approach to discovering truth from conflicting sources for data integration

Authors:
Bo Zhao;Benjamin I. P. Rubinstein;Jim Gemmell;Jiawei Han
Affiliations:
University of Illinois, Urbana, IL;Microsoft Research, Mountain View, CA;Microsoft Research, Mountain View, CA;University of Illinois, Urbana, IL
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 14
Cited 10

Consistent query answers in inconsistent databases

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Using Probabilistic Information in Data Integration

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Truth discovery with multiple conflicting information providers on the web

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Artificial Intelligence: A Modern Approach

Artificial Intelligence: A Modern Approach
Integrating conflicting data: the role of source dependence

Proceedings of the VLDB Endowment
Truth discovery and copying detection in a dynamic world

Proceedings of the VLDB Endowment
Corroborating information from disagreeing views

Proceedings of the third ACM international conference on Web search and data mining
Knowing what to believe (when you already know something)

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Probabilistic models to reconcile complex data from inaccurate data sources

CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
CoBayes: bayesian knowledge corroboration with assessors of unknown areas of expertise

Proceedings of the fourth ACM international conference on Web search and data mining
Semi-supervised truth discovery

Proceedings of the 20th international conference on World wide web
SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement

Proceedings of the 20th international conference on World wide web
Making better informed trust decisions with generalized fact-finding

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three

Mining knowledge from interconnected data: a heterogeneous information network analysis approach

Proceedings of the VLDB Endowment
Less is more: selecting sources wisely for integration

Proceedings of the VLDB Endowment
Truth finding on the deep web: is the problem solved?

Proceedings of the VLDB Endowment
Determining the relative accuracy of attributes

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Compact explanation of data fusion decisions

Proceedings of the 22nd international conference on World Wide Web
Latent credibility analysis

Proceedings of the 22nd international conference on World Wide Web
Mining collective intelligence in diverse groups

Proceedings of the 22nd international conference on World Wide Web
Reconciliation of categorical opinions from multiple sources

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Data fusion: resolving conflicts from multiple sources

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Maximum likelihood analysis of conflicting observations in social sensing

ACM Transactions on Sensor Networks (TOSN)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In practical data integration systems, it is common for the data sources being integrated to provide conflicting information about the same entity. Consequently, a major challenge for data integration is to derive the most complete and accurate integrated records from diverse and sometimes conflicting sources. We term this challenge the truth finding problem. We observe that some sources are generally more reliable than others, and therefore a good model of source quality is the key to solving the truth finding problem. In this work, we propose a probabilistic graphical model that can automatically infer true records and source quality without any supervision. In contrast to previous methods, our principled approach leverages a generative process of two types of errors (false positive and false negative) by modeling two different aspects of source quality. In so doing, ours is also the first approach designed to merge multi-valued attribute types. Our method is scalable, due to an efficient sampling-based inference algorithm that needs very few iterations in practice and enjoys linear time complexity, with an even faster incremental variant. Experiments on two real world datasets show that our new method outperforms existing state-of-the-art approaches to the truth finding problem.