Quality is in the eye of the beholder: towards user-centric web-databases
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Truth discovery with multiple conflicting information providers on the web
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
ACM Computing Surveys (CSUR)
Quality-aware collaborative question answering: methods and evaluation
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Data Quality Aware Queries in Collaborative Information Systems
APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Incorporating Domain-Specific Information Quality Constraints into Database Queries
Journal of Data and Information Quality (JDIQ)
Quality aware query scheduling in wireless sensor networks
Proceedings of the Sixth International Workshop on Data Management for Sensor Networks
Data fusion: resolving data conflicts for integration
Proceedings of the VLDB Endowment
Integrating conflicting data: the role of source dependence
Proceedings of the VLDB Endowment
Corroborating information from disagreeing views
Proceedings of the third ACM international conference on Web search and data mining
Advanced Metasearch Engine Technology
Advanced Metasearch Engine Technology
Knowing what to believe (when you already know something)
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Semi-supervised truth discovery
Proceedings of the 20th international conference on World wide web
A Bayesian approach to discovering truth from conflicting sources for data integration
Proceedings of the VLDB Endowment
Making better informed trust decisions with generalized fact-finding
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Truth finding on the deep web: is the problem solved?
Proceedings of the VLDB Endowment
10th international workshop on quality in databases: QDB 2012
ACM SIGMOD Record
Truth finding on the deep web: is the problem solved?
Proceedings of the VLDB Endowment
READFAST: high-relevance search-engine for big text
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
We are often thrilled by the abundance of information surrounding us and wish to integrate data from as many sources as possible. However, understanding, analyzing, and using these data are often hard. Too much data can introduce a huge integration cost, such as expenses for purchasing data and resources for integration and cleaning. Furthermore, including low-quality data can even deteriorate the quality of integration results instead of bringing the desired quality gain. Thus, "the more the better" does not always hold for data integration and often "less is more". In this paper, we study how to select a subset of sources before integration such that we can balance the quality of integrated data and integration cost. Inspired by the Marginalism principle in economic theory, we wish to integrate a new source only if its marginal gain, often a function of improved integration quality, is higher than the marginal cost, associated with data-purchase expense and integration resources. As a first step towards this goal, we focus on data fusion tasks, where the goal is to resolve conflicts from different sources. We propose a randomized solution for selecting sources for fusion and show empirically its effectiveness and scalability on both real-world data and synthetic data.