Identifying the components

  • Authors:
  • Matthijs Leeuwen;Jilles Vreeken;Arno Siebes

  • Affiliations:
  • Department of Computer Science, Universiteit Utrecht, Utrecht, The Netherlands;Department of Computer Science, Universiteit Utrecht, Utrecht, The Netherlands;Department of Computer Science, Universiteit Utrecht, Utrecht, The Netherlands

  • Venue:
  • Data Mining and Knowledge Discovery
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most, if not all, databases are mixtures of samples from different distributions. Transactional data is no exception. For the prototypical example, supermarket basket analysis, one also expects a mixture of different buying patterns. Households of retired people buy different collections of items than households with young children. Models that take such underlying distributions into account are in general superior to those that do not. In this paper we introduce two MDL-based algorithms that follow orthogonal approaches to identify the components in a transaction database. The first follows a model-based approach, while the second is data-driven. Both are parameter-free: the number of components and the components themselves are chosen such that the combined complexity of data and models is minimised. Further, neither prior knowledge on the distributions nor a distance metric on the data is required. Experiments with both methods show that highly characteristic components are identified.