Schema extraction

  • Authors:
  • Divesh Srivastava

  • Affiliations:
  • AT&T Labs-Research, Florham Park, NJ, USA

  • Venue:
  • CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Understanding the schema of a complex database is a crucial step in exploratory data analysis. However, gaining such an understanding is challenging for new users for many reasons. First, complex databases often have thousands of inter-linked tables, with little indication of the important tables or the main concepts in the database schema. Second, schemas can be inaccurate, e.g., some foreign/primary key relationships are not known to designers but are inherent in the data, while others become invalid due to data inconsistencies. In this talk, we present an approach to effectively address these challenges and automatically extract an understandable schema from a complex database. The first step in our approach is a robust algorithm to discover foreign/primary key relationships between tables. We present a general rule, termed Randomness, that subsumes a variety of other rules proposed in previous work, and develop efficient approximation algorithms for evaluating randomness, using only two passes over the data. The second step is a principled approach to summarize the schema consisting of tables linked using foreign/primary keys, so that a user can easily identify the main concepts and important tables. We present an information theoretic approach to identify important tables, and an intuitive notion of table similarity that can be used to cluster tables into the main concepts of the schema. We validate our approach using real and synthetic datasets. This is based on joint work [1, 2] with Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M. Procopiuc, Xiaoyan Yang and Meihui Zhang.