Learning to create data-integrating queries

Authors:
Partha Pratim Talukdar;Marie Jacob;Muhammad Salman Mehmood;Koby Crammer;Zachary G. Ives;Fernando Pereira;Sudipto Guha
Affiliations:
University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA
Venue:
Proceedings of the VLDB Endowment
Year:
2008

Citing 28
Cited 25

Steiner problem in networks: a survey

Networks
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
An adaptive query execution system for data integration

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
RQL: a declarative query language for RDF

Proceedings of the 11th international conference on World Wide Web
Modern Information Retrieval

Modern Information Retrieval
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Towards a theory of natural language interfaces to databases

Proceedings of the 8th international conference on Intelligent user interfaces
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
XRANK: ranked keyword search over XML documents

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Keyword Searching and Browsing in Databases using BANKS

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Lineage tracing in data warehouses

Lineage tracing in data warehouses
BEA Liquid Data for WebLogic: XML-Based Enterprise Information Integration

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Evaluating top-k queries over web-accessible databases

ACM Transactions on Database Systems (TODS)
RankSQL: query algebra and optimization for relational top-k queries

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Bidirectional expansion for keyword search on graph databases

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Finding and approximating top-k answers in keyword proximity search

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
Provenance semirings

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Discover: keyword search in relational databases

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Objectrank: authority-based keyword search in databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient query evaluation on probabilistic databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Update exchange with mappings and provenance

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
BioGuideSRS

Bioinformatics
NAGA: harvesting, searching and ranking knowledge

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Building Mashups by example

Proceedings of the 13th international conference on Intelligent user interfaces
Explaining and Reformulating Authority Flow Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

The ORCHESTRA Collaborative Data Sharing System

ACM SIGMOD Record
Keyword search on structured and semi-structured data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
BioBrowsing: Making the Most of the Data Available in Entrez

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Data Integration and Exchange for Scientific Collaboration

DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
A unified approach to ranking in probabilistic databases

Proceedings of the VLDB Endowment
Feedback-based annotation, selection and refinement of schema mappings for dataspaces

Proceedings of the 13th International Conference on Extending Database Technology
Automatically incorporating new sources in keyword search-based data integration

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Querying data provenance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A unified approach to ranking in probabilistic databases

The VLDB Journal — The International Journal on Very Large Data Bases
Finding a minimal tree pattern under neighborhood constraints

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sharing work in keyword search over databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Pay-as-you-go mapping selection in dataspaces

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Answering complex structured queries over the deep web

Proceedings of the 15th Symposium on International Database Engineering & Applications
REX: explaining relationships between entity pairs

Proceedings of the VLDB Endowment
Chapter 7: dataspaces

Search Computing
DSToolkit: an architecture for flexible dataspace management

Transactions on Large-Scale Data- and Knowledge-Centered Systems V
Sample-driven schema mapping

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Pay-as-you-go data integration for linked data: opportunities, challenges and architectures

SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
Stratified k-means clustering over a deep web data source

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Querying provenance for ranking and recommending

TaPP'12 Proceedings of the 4th USENIX conference on Theory and Practice of Provenance
Pay-as-You-Go ranking of schema mappings using query logs

DILS'12 Proceedings of the 8th international conference on Data Integration in the Life Sciences
Extracting minimum-weight tree patterns from a schema with neighborhood constraints

Proceedings of the 16th International Conference on Database Theory
Actively soliciting feedback for query answers in keyword search-based data integration

Proceedings of the VLDB Endowment
Incrementally improving dataspaces based on user feedback

Information Systems
Collaborative data sharing via update exchange and provenance

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The number of potentially-related data resources available for querying --- databases, data warehouses, virtual integrated schemas --- continues to grow rapidly. Perhaps no area has seen this problem as acutely as the life sciences, where hundreds of large, complex, interlinked data resources are available on fields like proteomics, genomics, disease studies, and pharmacology. The schemas of individual databases are often large on their own, but users also need to pose queries across multiple sources, exploiting foreign keys and schema mappings. Since the users are not experts, they typically rely on the existence of pre-defined Web forms and associated query templates, developed by programmers to meet the particular scientists' needs. Unfortunately, such forms are scarce commodities, often limited to a single database, and mismatched with biologists' information needs that are often context-sensitive and span multiple databases. We present a system with which a non-expert user can author new query templates and Web forms, to be reused by anyone with related information needs. The user poses keyword queries that are matched against source relations and their attributes; the system uses sequences of associations (e.g., foreign keys, links, schema mappings, synonyms, and taxonomies) to create multiple ranked queries linking the matches to keywords; the set of queries is attached to a Web query form. Now the user and his or her associates may pose specific queries by filling in parameters in the form. Importantly, the answers to this query are ranked and annotated with data provenance, and the user provides feedback on the utility of the answers, from which the system ultimately learns to assign costs to sources and associations according to the user's specific information need, as a result changing the ranking of the queries used to generate results. We evaluate the effectiveness of our method against "gold standard" costs from domain experts and demonstrate the method's scalability.