Automatically incorporating new sources in keyword search-based data integration

Authors:
Partha Pratim Talukdar;Zachary G. Ives;Fernando Pereira
Affiliations:
University of Pennsylvania, Philadelphia, PA, USA;University of Pennsylvania, Philadelphia, PA, USA;Google, Inc., Mountain View, CA, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 31
Cited 10

Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Generic Schema Matching with Cupid

Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Keyword Searching and Browsing in Databases using BANKS

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Evaluating top-k queries over web-accessible databases

ACM Transactions on Database Systems (TODS)
Understanding Web query interfaces: best-effort parsing with hidden syntax

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
RankSQL: query algebra and optimization for relational top-k queries

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Bidirectional expansion for keyword search on graph databases

VLDB '05 Proceedings of the 31st international conference on Very large data bases
From databases to dataspaces: a new abstraction for information management

ACM SIGMOD Record
Contextual search and name disambiguation in email using graphs

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Matching large schemas: Approaches and evaluation

Information Systems
BLINKS: ranked keyword searches on graphs

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
Discover: keyword search in relational databases

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Objectrank: authority-based keyword search in databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Data integration with uncertainty

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Video suggestion and discovery for youtube: taking random walks through the view graph

Proceedings of the 17th international conference on World Wide Web
Pay-as-you-go user feedback for dataspace systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Toward best-effort information extraction

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Building Mashups by example

Proceedings of the 13th international conference on Intelligent user interfaces
Fine-grained relevance feedback for XML retrieval

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Learning to create data-integrating queries

Proceedings of the VLDB Endowment
Google's Deep Web crawl

Proceedings of the VLDB Endowment
STAR: Steiner-Tree Approximation in Relationship Graphs

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Efficiently incorporating user feedback into information extraction and integration programs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Weakly-supervised acquisition of labeled class instances using graph random walks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Sharing work in keyword search over databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Layered graph data model for data management of dataspace support platform

WAIM'11 Proceedings of the 12th international conference on Web-age information management
DSToolkit: an architecture for flexible dataspace management

Transactions on Large-Scale Data- and Knowledge-Centered Systems V
Sample-driven schema mapping

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Feedback-based data set recommendation for building linked data applications

Proceedings of the 8th International Conference on Semantic Systems
Collectively representing semi-structured data from the web

AKBC-WEKEX '12 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
Using information quality for the identification of relevant web data sources: a proposal

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Actively soliciting feedback for query answers in keyword search-based data integration

Proceedings of the VLDB Endowment
Incrementally improving dataspaces based on user feedback

Information Systems
Big data challenge: a data management perspective

Frontiers of Computer Science: Selected Publications from Chinese Universities

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific data offers some of the most interesting challenges in data integration today. Scientific fields evolve rapidly and accumulate masses of observational and experimental data that needs to be annotated, revised, interlinked, and made available to other scientists. From the perspective of the user, this can be a major headache as the data they seek may initially be spread across many databases in need of integration. Worse, even if users are given a solution that integrates the current state of the source databases, new data sources appear with new data items of interest to the user. Here we build upon recent ideas for creating integrated views over data sources using keyword search techniques, ranked answers, and user feedback [32] to investigate how to automatically discover when a new data source has content relevant to a user's view - in essence, performing automatic data integration for incoming data sets. The new architecture accommodates a variety of methods to discover related attributes, including label propagation algorithms from the machine learning community [2] and existing schema matchers [11]. The user may provide feedback on the suggested new results, helping the system repair any bad alignments or increase the cost of including a new source that is not useful. We evaluate our approach on actual bioinformatics schemas and data, using state-of-the-art schema matchers as components. We also discuss how our architecture can be adapted to more traditional settings with a mediated schema.