Fragments of order

  • Authors:
  • Aristides Gionis;Teija Kujala;Heikki Mannila

  • Affiliations:
  • Stanford University, Stanford, CA;University of Helsinki, P.O. Box 26, Teollisuuskatu 23, Helsinki, Finland;University of Helsinki, P.O. Box 26, Teollisuuskatu 23, Helsinki, Finland

  • Venue:
  • Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

High-dimensional collections of 0--1 data occur in many applications. The attributes in such data sets are typically considered to be unordered. However, in many cases there is a natural total or partial order ≺ underlying the variables of the data set. Examples of variables for which such orders exist include terms in documents, courses in enrollment data, and paleontological sites in fossil data collections. The observations in such applications are flat, unordered sets; however, the data sets respect the underlying ordering of the variables. By this we mean that if A ≺ B ≺ C are three variables respecting the underlying ordering ≺, and both of variables A and C appear in an observation, then, up to noise levels, variable B also appears in this observation. Similarly, if A1 ≺ A2 ≺ … ≺ Al-1 ≺ Ai is a longer sequence of variables, we do not expect to see many observations for which there are indices i j k such that Ai and Ak occur in the observation but Aj does not.In this paper we study the problem of discovering fragments of orders of variables implicit in collections of unordered observations. We define measures that capture how well a given order agrees with the observed data. We describe a simple and efficient algorithm for finding all the fragments that satisfy certain conditions. We also discuss the sometimes necessary postprocessing for selecting only the best fragments of order. Also, we relate our method with a sequencing approach that uses a spectral algorithm, and with the consecutive ones problem. We present experimental results on some real data sets (author lists of database papers, exam results data, and paleontological data).