Reconciling schemas of disparate data sources: a machine-learning approach
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
Text joins in an RDBMS for web data integration
WWW '03 Proceedings of the 12th international conference on World Wide Web
Optimal aggregation algorithms for middleware
Journal of Computer and System Sciences - Special issu on PODS 2001
XRANK: ranked keyword search over XML documents
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Keyword Searching and Browsing in Databases using BANKS
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Bidirectional expansion for keyword search on graph databases
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Sample Selection for Statistical Parsing
Computational Linguistics
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Matching large schemas: Approaches and evaluation
Information Systems
BLINKS: ranked keyword searches on graphs
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Online Passive-Aggressive Algorithms
The Journal of Machine Learning Research
Discover: keyword search in relational databases
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Supporting top-K join queries in relational databases
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Objectrank: authority-based keyword search in databases
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
An experimental comparison of click position-bias models
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Pay-as-you-go user feedback for dataspace systems
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Managing Uncertainty in Schema Matcher Ensembles
SUM '07 Proceedings of the 1st international conference on Scalable Uncertainty Management
Learning to create data-integrating queries
Proceedings of the VLDB Endowment
Click chain model in web search
Proceedings of the 18th international conference on World wide web
An analysis of active learning strategies for sequence labeling tasks
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Reducing labeling effort for structured prediction tasks
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Automatically incorporating new sources in keyword search-based data integration
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On active learning of record matching packages
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Tuning the ensemble selection process of schema matchers
Information Systems
Proceedings of the VLDB Endowment
Keyword search over relational databases: a metadata approach
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Sharing work in keyword search over databases
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Personalized click model through collaborative filtering
Proceedings of the fifth ACM international conference on Web search and data mining
Hi-index | 0.00 |
The problem of scaling up data integration, such that new sources can be quickly utilized as they are discovered, remains elusive: global schemas for integrated data are difficult to develop and expand, and schema and record matching techniques are limited by the fact that data and metadata are often under-specified and must be disambiguated by data experts. One promising approach is to avoid using a global schema, and instead to develop keyword search-based data integration--where the system lazily discovers associations enabling it to join together matches to keywords, and return ranked results. The user is expected to understand the data domain and provide feedback about answers' quality. The system generalizes such feedback to learn how to correctly integrate data. A major open challenge is that under this model, the user only sees and offers feedback on a few "top-k" results: this result set must be carefully selected to include answers of high relevance and answers that are highly informative when feedback is given on them. Existing systems merely focus on predicting relevance, by composing the scores of various schema and record matching algorithms. In this paper we show how to predict the uncertainty associated with a query result's score, as well as how informative feedback is on a given result. We build upon these foundations to develop an active learning approach to keyword search-based data integration, and we validate the effectiveness of our solution over real data from several very different domains.