Schema-as-you-go: on probabilistic tagging and querying of wide tables

Authors:
Meiyu Lu;Divyakant Agrawal;Bing Tian Dai;Anthony K.H. Tung
Affiliations:
National University of Singapore, Singapore, Singapore;University of California at Santa Barbara, Santa Barbara, USA;National University of Singapore, Singapore, Singapore;National University of Singapore, Singapore, Singapore
Venue:
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Year:
2011

Citing 25
Cited 0

Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Semantic Integration in Heterogeneous Databases Using Neural Networks

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Database Schema Matching Using Machine Learning with Feature Selection

CAiSE '02 Proceedings of the 14th International Conference on Advanced Information Systems Engineering
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Rapid Identification of Column Heterogeneity

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Clustering with Bregman Divergences

The Journal of Machine Learning Research
Indexing dataspaces

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Effective keyword-based selection of relational databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Query relaxation using malleable schemas

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
The case for a wide-table approach to manage sparse relational data sets

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Data integration with uncertainty

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Bootstrapping pay-as-you-go data integration systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
RDF-3X: a RISC-style engine for RDF

Proceedings of the VLDB Endowment
Approximate lineage for probabilistic databases

Proceedings of the VLDB Endowment
Evaluating similarity measures for emergent semantics of social tagging

Proceedings of the 18th international conference on World wide web
Validating Multi-column Schema Matchings by Type

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Top-k queries on uncertain data: on score distribution and typical answers

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Similarity search on Bregman divergence: towards non-metric indexing

Proceedings of the VLDB Endowment
Google fusion tables: data management, integration and collaboration in the cloud

Proceedings of the 1st ACM symposium on Cloud computing
OpenII: an open source information integration toolkit

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On-the-fly entity-aware query processing in the presence of linkage

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

The emergence of Web 2.0 has resulted in a huge amount of heterogeneous data that are contributed by a large number of users, engendering new challenges for data management and query processing. Given that the data are unified from various sources and accessed by numerous users, providing users with a unified mediated schema as data integration is insufficient. On one hand, a deterministic mediated schema restricts users' freedom to express queries in their preferred vocabulary; on the other hand, it is not realistic for users to remember the numerous attribute names that arise from integrating various data sources. As such, a user-oriented data management and query interface is required. In this paper, we propose an out-of-the-box approach that separates users' actions from database operations. This separating layer deals with the challenges from a semantic perspective. It interprets the semantics of each data value through tags that are provided by users, and then inserts the value into the database together with these tags. When querying the database, this layer also serves as a platform for retrieving data by interpreting the semantics of the queried tags from the users. Experiments are conducted to illustrate both the effectiveness and efficiency of our approach.