Semi-supervised truth discovery

Authors:
Xiaoxin Yin;Wenzhao Tan
Affiliations:
Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the 20th international conference on World wide web
Year:
2011

Citing 12
Cited 9

Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Truth discovery with multiple conflicting information providers on the web

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Integrated graph-based semi-supervised multiple/single instance learning framework for image annotation

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Integrating conflicting data: the role of source dependence

Proceedings of the VLDB Endowment
Truth discovery and copying detection in a dynamic world

Proceedings of the VLDB Endowment
A graph-based semi-supervised learning for question-answering

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Corroborating information from disagreeing views

Proceedings of the third ACM international conference on Web search and data mining
Automatic extraction of clickable structured web contents for name entity queries

Proceedings of the 19th international conference on World wide web
Web-scale knowledge extraction from semi-structured tables

Proceedings of the 19th international conference on World wide web
Global detection of complex copying relationships between sources

Proceedings of the VLDB Endowment

Heterogeneous network-based trust analysis: a survey

ACM SIGKDD Explorations Newsletter
A Bayesian approach to discovering truth from conflicting sources for data integration

Proceedings of the VLDB Endowment
On truth discovery in social sensing: a maximum likelihood estimation approach

Proceedings of the 11th international conference on Information Processing in Sensor Networks
Less is more: selecting sources wisely for integration

Proceedings of the VLDB Endowment
Truth finding on the deep web: is the problem solved?

Proceedings of the VLDB Endowment
Assessing relevance and trust of the deep web sources and results based on inter-source agreement

ACM Transactions on the Web (TWEB)
Compact explanation of data fusion decisions

Proceedings of the 22nd international conference on World Wide Web
Mining collective intelligence in diverse groups

Proceedings of the 22nd international conference on World Wide Web
Maximum likelihood analysis of conflicting observations in social sensing

ACM Transactions on Sensor Networks (TOSN)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Accessing online information from various data sources has become a necessary part of our everyday life. Unfortunately such information is not always trustworthy, as different sources are of very different qualities and often provide inaccurate and conflicting information. Existing approaches attack this problem using unsupervised learning methods, and try to infer the confidence of the data value and trustworthiness of each source from each other by assuming values provided by more sources are more accurate. However, because false values can be widespread through copying among different sources and out-of-date data often overwhelm up-to-date data, such bootstrapping methods are often ineffective. In this paper we propose a semi-supervised approach that finds true values with the help of ground truth data. Such ground truth data, even in very small amount, can greatly help us identify trustworthy data sources. Unlike existing studies that only provide iterative algorithms, we derive the optimal solution to our problem and provide an iterative algorithm that converges to it. Experiments show our method achieves higher accuracy than existing approaches, and it can be applied on very huge data sets when implemented with MapReduce.