Bootstrapping pay-as-you-go data integration systems

Authors:
Anish Das Sarma;Xin Dong;Alon Halevy
Affiliations:
Stanford University, Stanford, CA, USA;AT&T Labs-Research, New Jersey, NJ, USA;Google, Mountain View, CA, USA
Venue:
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Year:
2008

Citing 22
Cited 63

A comparative analysis of methodologies for database schema integration

ACM Computing Surveys (CSUR)
Methods and tools for equivalent data model mapping construction

EDBT '90 Proceedings of the 2nd international conference on extending database technology: Advances in Database Technology
A maximum entropy approach to natural language processing

Computational Linguistics
Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning to map between ontologies on the semantic web

Proceedings of the 11th international conference on World Wide Web
Relative information capacity of simple relational database schemata

PODS '84 Proceedings of the 3rd ACM SIGACT-SIGMOD symposium on Principles of database systems
Theoretical Aspects of Schema Merging

EDBT '92 Proceedings of the 3rd International Conference on Extending Database Technology: Advances in Database Technology
The Use of Information Capacity in Schema Integration and Translation

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Database Schema Matching Using Machine Learning with Feature Selection

CAiSE '02 Proceedings of the 14th International Conference on Advanced Information Systems Engineering
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
On schema matching with opaque column names and data values

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
iMAP: discovering complex semantic matches between database schemas

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
From databases to dataspaces: a new abstraction for information management

ACM SIGMOD Record
Information retrieval and machine learning for probabilistic schema matching

Information Processing and Management: an International Journal
Why is schema matching tough and what can we do about it?

ACM SIGMOD Record
COMA: a system for flexible combination of schema matching approaches

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Instance-based schema matching for web databases by domain-specific query probing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Data integration with uncertainty

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Pay-as-you-go user feedback for dataspace systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Schema integration based on uncertain semantic mappings

ER'05 Proceedings of the 24th international conference on Conceptual Modeling

Pay-as-you-go user feedback for dataspace systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
System support for exploration and expert feedback in resolving conflicts during integration of metadata

The VLDB Journal — The International Journal on Very Large Data Bases
A first tutorial on dataspaces

Proceedings of the VLDB Endowment
Wildcards for lightweight information integration in virtual desktops

Proceedings of the 17th ACM conference on Information and knowledge management
Ten Challenges for Ontology Matching

OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part II on On the Move to Meaningful Internet Systems
Efficient top-k count queries over imprecise duplicates

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Web-scale extraction of structured data

ACM SIGMOD Record
Data integration with uncertainty

The VLDB Journal — The International Journal on Very Large Data Bases
Dimensions of Dataspaces

BNCOD 26 Proceedings of the 26th British National Conference on Databases: Dataspace: The Final Frontier
Data Modeling in Dataspace Support Platforms

Conceptual Modeling: Foundations and Applications
Automatic Generation of P2P Mappings between Sources Schemas

ISMIS '09 Proceedings of the 18th International Symposium on Foundations of Intelligent Systems
Towards Relational Schema Uncertainty

SUM '09 Proceedings of the 3rd International Conference on Scalable Uncertainty Management
Information integration with uncertainty

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Hermes: Data Web search on a pay-as-you-go integration infrastructure

Web Semantics: Science, Services and Agents on the World Wide Web
Qualitative effects of knowledge rules and user feedback in probabilistic data integration

The VLDB Journal — The International Journal on Very Large Data Bases
Cooperative update exchange in the Youtopia system

Proceedings of the VLDB Endowment
Ranking Approximate Query Rewritings Based on Views

FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
The software EBox: integrated information for situational awareness

ISI'09 Proceedings of the 2009 IEEE international conference on Intelligence and security informatics
Flexible Dataspace Management Through Model Management

Proceedings of the 2010 EDBT/ICDT Workshops
Probabilistic data exchange

Proceedings of the 13th International Conference on Database Theory
A Survey on Uncertainty Management in Data Integration

Journal of Data and Information Quality (JDIQ)
Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
US-SQL: managing uncertain schemata

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Redundancy-driven web data extraction and integration

Procceedings of the 13th International Workshop on the Web and Databases
Automatic schema merging using mapping constraints among incomplete sources

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
PruSM: a prudent schema matching approach for web forms

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Towards large-scale scientific dataspaces for e-science applications

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
Top-k generation of mediated schemas over multiple data sources

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
Automatic multi-schema integration based on user preference

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Instance discovery and schema matching with applications to biological deep web data integration

DILS'10 Proceedings of the 7th international conference on Data integration in the life sciences
Double-layered schema integration of heterogeneous XML sources

Journal of Systems and Software
Automatic generation of probabilistic relationships for improving schema matching

Information Systems
FORUM: a flexible data integration system based on data semantics

ACM SIGMOD Record
Scalable data exchange with functional dependencies

Proceedings of the VLDB Endowment
Foundations of uncertain-data integration

Proceedings of the VLDB Endowment
Automatic normalization and annotation for discovering semantic mappings

Search computing
Schema-as-you-go: on probabilistic tagging and querying of wide tables

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Pay-as-you-go mapping selection in dataspaces

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Probabilistic data exchange

Journal of the ACM (JACM)
Potential role based entity matching for dataspaces search

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Discovering implicit categorical semantics for schema matching

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
Rewriting fuzzy queries using imprecise views

ADBIS'11 Proceedings of the 15th international conference on Advances in databases and information systems
Ontology-based data management

Proceedings of the 20th ACM international conference on Information and knowledge management
Adapting Searchy to extract data using evolved wrappers

Expert Systems with Applications: An International Journal
Merging relational views: a minimization approach

ER'11 Proceedings of the 30th international conference on Conceptual modeling
Chapter 7: dataspaces

Search Computing
Theoretical foundations for enabling a web of knowledge

FoIKS'10 Proceedings of the 6th international conference on Foundations of Information and Knowledge Systems
DSToolkit: an architecture for flexible dataspace management

Transactions on Large-Scale Data- and Knowledge-Centered Systems V
Data quality and integration in collaborative environments

PhD '12 Proceedings of the on SIGMOD/PODS 2012 PhD Symposium
Dynamic workload driven data integration in tableau

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Pay-as-you-go data integration for linked data: opportunities, challenges and architectures

SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
Efficient management of uncertainty in XML schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
3SEPIAS: A Semi-Structured Search Engine for Personal Information in dAtaspace System

Information Sciences: an International Journal
Indexing dataspaces with partitions

World Wide Web
Incrementally improving dataspaces based on user feedback

Information Systems
Comparable dependencies over heterogeneous data

The VLDB Journal — The International Journal on Very Large Data Bases
Learning to crawl deep web

Information Systems
Big data challenge: a data management perspective

Frontiers of Computer Science: Selected Publications from Chinese Universities
On-demand multidimensional data integration: toward a semantic foundation for cloud intelligence

The Journal of Supercomputing
Wearable queries: adapting common retrieval needs to data and users

Proceedings of the 7th International Workshop on Ranking in Databases
Reducing uncertainty of schema matching via crowdsourcing

Proceedings of the VLDB Endowment
Schema matching prediction with applications to data source discovery and dynamic ensembling

The VLDB Journal — The International Journal on Very Large Data Bases
Target-driven merging of taxonomies with Atom

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data integration systems offer a uniform interface to a set of data sources. Despite recent progress, setting up and maintaining a data integration application still requires significant upfront effort of creating a mediated schema and semantic mappings from the data sources to the mediated schema. Many application contexts involving multiple data sources (e.g., the web, personal information management, enterprise intranets) do not require full integration in order to provide useful services, motivating a pay-as-you-go approach to integration. With that approach, a system starts with very few (or inaccurate) semantic mappings and these mappings are improved over time as deemed necessary. This paper describes the first completely self-configuring data integration system. The goal of our work is to investigate how advanced of a starting point we can provide a pay-as-you-go system. Our system is based on the new concept of a probabilistic mediated schema that is automatically created from the data sources. We automatically create probabilistic schema mappings between the sources and the mediated schema. We describe experiments in multiple domains, including 50-800 data sources, and show that our system is able to produce high-quality answers with no human intervention.