An Infornation-Theoretic Analysis of Relational Databases Part I: Data Dependencies and Information Metric

  • Authors:
  • Tony T. Lee

  • Affiliations:
  • Bell Communications Research, Morristown, NJ

  • Venue:
  • IEEE Transactions on Software Engineering
  • Year:
  • 1987

Quantified Score

Hi-index 0.00

Visualization

Abstract

Database design is based on the concept of data dependency, which is the interrelationship between data contained in various sets of attributes. In particular, functional, multivalued and acyclic join, dependencies play an essential role in the design of database schemas. The basic definition of an information metric and how this notion can be used in relational database are discussed in this paper. We use Shannon entropy as an information metric to quantify the information associated with a set of attributes. Thus, we prove that data dependencies can be formulated in terms of entropies. These formulas make the numerical computation and testing of data dependencies feasible. Among the different types of data dependencies, the acyclic join dependency is most important to the design of a relational database schema. The acyclic join dependency, with multivalued dependency as a special case, impose a constraint on the information-preserving decomposition of a relation. It is interesting that this constraint on a relation is similar to Gibbs' condition for separating physical systems in statistical mechanics. They both assert that entropy is preserved during the decomposition process. That is, the entropies of the corresponding set of attributes must satisfy the inclusion-exclusion identity.