Summarizing relational databases

Authors:
Xiaoyan Yang;Cecilia M. Procopiuc;Divesh Srivastava
Affiliations:
National Univ. of Singapore, Republic of Singapore;AT&T Labs--Research, Florham Park, NJ;AT&T Labs--Research, Florham Park, NJ
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 10
Cited 12

Elements of information theory

Elements of information theory
Randomized algorithms

Randomized algorithms
Entity-relationship and object-oriented model automatic clustering

Data & Knowledge Engineering
FleXPath: flexible structure and full-text querying for XML

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Schema summarization

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Making database systems usable

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Fast direction-aware proximity for graph mining

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
BANKS: browsing and keyword searching in relational databases

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Querying complex structured databases

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Automated creation of a forms-based database query interface

Proceedings of the VLDB Endowment

Graph-based concept identification and disambiguation for enterprise search

Proceedings of the 19th international conference on World wide web
Constructing and exploring composite items

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Schema extraction

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
SnipSuggest: context-aware autocompletion for SQL

Proceedings of the VLDB Endowment
A novel keyword search paradigm in relational databases: Object summaries

Data & Knowledge Engineering
A method for filtering large conceptual schemas

ER'10 Proceedings of the 29th international conference on Conceptual modeling
Automatic example queries for ad hoc databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Database-as-a-service for long-tail science

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Data exploration and knowledge discovery in a patient wellness tracking (PWT) system at a nurse-managed health services center

Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
On computing the importance of associations in large conceptual schemas

Conceptual Modelling and Its Theoretical Foundations
iSearch: an interpretation based framework for keyword search in relational databases

KEYS '12 Proceedings of the Third International Workshop on Keyword Search on Structured Data
SODA: generating SQL for business users

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Complex databases are challenging to explore and query by users unfamiliar with their schemas. Enterprise databases often have hundreds of inter-linked tables, so even when extensive documentation is available, new users must spend a considerable amount of time understanding the schema before they can retrieve any information from the database. The problem is aggravated if the documentation is missing or outdated, which may happen with legacy databases. In this paper we identify limitations of previous approaches to address this vexing problem, and propose a principled approach to summarizing the contents of a relational database, so that a user can determine at a glance the type of information it contains, and the main tables in which that information resides. Our approach has three components: First, we define the importance of each table in the database as its stable state value in a random walk over the schema graph, where the transition probabilities depend on the entropies of table attributes. This ensures that the importance of a table depends both on its information content, and on how that content relates to the content of other tables in the database. Second, we define a metric space over the tables in a database, such that the distance function is consistent with an intuitive notion of table similarity. Finally, we use a Weighted k-Center algorithm under this distance function to cluster all tables in the database around the most relevant tables, and return the result as our summary. We conduct an extensive experimental study on a benchmark database, comparing our approach with previous methods, as well as with several hybrid models. We show that our approach not only achieves significantly higher accuracy than the previous state of the art, but is also faster and scales linearly with the size of the schema graph.