Less is more: selecting sources wisely for integration

Authors:
Xin Luna Dong;Barna Saha;Divesh Srivastava
Affiliations:
AT&T Labs-Research;AT&T Labs-Research;AT&T Labs-Research
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 17
Cited 3

Quality is in the eye of the beholder: towards user-centric web-databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Truth discovery with multiple conflicting information providers on the web

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
A research agenda for query processing in large-scale peer data management systems

Information Systems
Data fusion

ACM Computing Surveys (CSUR)
Quality-aware collaborative question answering: methods and evaluation

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Data Quality Aware Queries in Collaborative Information Systems

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Incorporating Domain-Specific Information Quality Constraints into Database Queries

Journal of Data and Information Quality (JDIQ)
Quality aware query scheduling in wireless sensor networks

Proceedings of the Sixth International Workshop on Data Management for Sensor Networks
Data fusion: resolving data conflicts for integration

Proceedings of the VLDB Endowment
Integrating conflicting data: the role of source dependence

Proceedings of the VLDB Endowment
Corroborating information from disagreeing views

Proceedings of the third ACM international conference on Web search and data mining
Advanced Metasearch Engine Technology

Advanced Metasearch Engine Technology
Knowing what to believe (when you already know something)

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Semi-supervised truth discovery

Proceedings of the 20th international conference on World wide web
A Bayesian approach to discovering truth from conflicting sources for data integration

Proceedings of the VLDB Endowment
Making better informed trust decisions with generalized fact-finding

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Truth finding on the deep web: is the problem solved?

Proceedings of the VLDB Endowment

10th international workshop on quality in databases: QDB 2012

ACM SIGMOD Record
Truth finding on the deep web: is the problem solved?

Proceedings of the VLDB Endowment
READFAST: high-relevance search-engine for big text

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

We are often thrilled by the abundance of information surrounding us and wish to integrate data from as many sources as possible. However, understanding, analyzing, and using these data are often hard. Too much data can introduce a huge integration cost, such as expenses for purchasing data and resources for integration and cleaning. Furthermore, including low-quality data can even deteriorate the quality of integration results instead of bringing the desired quality gain. Thus, "the more the better" does not always hold for data integration and often "less is more". In this paper, we study how to select a subset of sources before integration such that we can balance the quality of integrated data and integration cost. Inspired by the Marginalism principle in economic theory, we wish to integrate a new source only if its marginal gain, often a function of improved integration quality, is higher than the marginal cost, associated with data-purchase expense and integration resources. As a first step towards this goal, we focus on data fusion tasks, where the goal is to resolve conflicts from different sources. We propose a randomized solution for selecting sources for fusion and show empirically its effectiveness and scalability on both real-world data and synthetic data.