Actively soliciting feedback for query answers in keyword search-based data integration

Authors:
Zhepeng Yan;Nan Zheng;Zachary G. Ives;Partha Pratim Talukdar;Cong Yu
Affiliations:
University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA;Carnegie Mellon University, Pittsburgh, PA;Google, Inc., New York, NY
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 31
Cited 0

Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Optimal aggregation algorithms for middleware

Journal of Computer and System Sciences - Special issu on PODS 2001
XRANK: ranked keyword search over XML documents

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Keyword Searching and Browsing in Databases using BANKS

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Bidirectional expansion for keyword search on graph databases

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Sample Selection for Statistical Parsing

Computational Linguistics
From databases to dataspaces: a new abstraction for information management

ACM SIGMOD Record
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Matching large schemas: Approaches and evaluation

Information Systems
BLINKS: ranked keyword searches on graphs

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
Discover: keyword search in relational databases

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Supporting top-K join queries in relational databases

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Objectrank: authority-based keyword search in databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
An experimental comparison of click position-bias models

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Pay-as-you-go user feedback for dataspace systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Managing Uncertainty in Schema Matcher Ensembles

SUM '07 Proceedings of the 1st international conference on Scalable Uncertainty Management
Learning to create data-integrating queries

Proceedings of the VLDB Endowment
Click chain model in web search

Proceedings of the 18th international conference on World wide web
An analysis of active learning strategies for sequence labeling tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Reducing labeling effort for structured prediction tasks

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Automatically incorporating new sources in keyword search-based data integration

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Tuning the ensemble selection process of schema matchers

Information Systems
Guided data repair

Proceedings of the VLDB Endowment
Keyword search over relational databases: a metadata approach

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Sharing work in keyword search over databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Personalized click model through collaborative filtering

Proceedings of the fifth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of scaling up data integration, such that new sources can be quickly utilized as they are discovered, remains elusive: global schemas for integrated data are difficult to develop and expand, and schema and record matching techniques are limited by the fact that data and metadata are often under-specified and must be disambiguated by data experts. One promising approach is to avoid using a global schema, and instead to develop keyword search-based data integration--where the system lazily discovers associations enabling it to join together matches to keywords, and return ranked results. The user is expected to understand the data domain and provide feedback about answers' quality. The system generalizes such feedback to learn how to correctly integrate data. A major open challenge is that under this model, the user only sees and offers feedback on a few "top-k" results: this result set must be carefully selected to include answers of high relevance and answers that are highly informative when feedback is given on them. Existing systems merely focus on predicting relevance, by composing the scores of various schema and record matching algorithms. In this paper we show how to predict the uncertainty associated with a query result's score, as well as how informative feedback is on a given result. We build upon these foundations to develop an active learning approach to keyword search-based data integration, and we validate the effectiveness of our solution over real data from several very different domains.