Sequential dependencies

  • Authors:
  • Lukasz Golab;Howard Karloff;Flip Korn;Avishek Saha;Divesh Srivastava

  • Affiliations:
  • AT&T Labs--Research;AT&T Labs--Research;AT&T Labs--Research;University of Utah;AT&T Labs--Research

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study sequential dependencies that express the semantics of data with ordered domains and help identify quality problems with such data. Given an interval g, we write X →g Y to denote that the difference between the Y -attribute values of any two consecutive records, when sorted on X, must be in g. For example, time →(0,∞) sequence_number indicates that sequence numbers are strictly increasing over time, whereas sequence_number →[4, 5] time means that the time "gaps" between consecutive sequence numbers are between 4 and 5. Sequential dependencies express relationships between ordered attributes, and identify missing (gaps too large), extraneous (gaps too small) and out-of-order data. To make sequential dependencies applicable to real-world data, we relax their requirements and allow them to hold approximately (with some exceptions) and conditionally (on various subsets of the data). This paper proposes the notion of conditional approximate sequential dependencies and provides an efficient framework for discovering pattern tableaux, which are compact representations of the subsets of the data (i.e., ranges of values of the ordered attributes) that satisfy the underlying dependency. We present analyses of our proposed algorithms, and experiments on real data demonstrating the efficiency and utility of our framework.