The generalized MDL approach for summarization

  • Authors:
  • Laks V. S. Lakshmanan;Raymond T. Ng;Christine Xing Wang;Xiaodong Zhou;Theodore J. Johnson

  • Affiliations:
  • Univ. of British Columbia;Univ. of British Columbia;Univ. of British Columbia;Univ. of British Columbia;AT&T Labs-Research

  • Venue:
  • VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

There are many applications in OLAP and data analysis where we identify regions of interest. For example, in OLAP, an analysis query involving aggregate sales performance of various products in different locations and seasons could help identify interesting cells, such as cells of a data cube having an aggregate sales higher than a threshold. While a normal answer to such a quiry merely returns all interesting cells, it may be far more informative to the user if the system return summaries or descriptions of regions formed from the identified cells. The minimum Description Length (MDL) principle is a well-known strategy for finding such region descriptions. In this paper, we propose a generalization of the MDL principle, called GMDL, and show that GMDL leads to fewer regions than MDL, and hence more concise "answers" returned to the user. The key idea is that a region may contain "don't care" cells (up to a global maximum), if these "don't care" cells help to form bigger summary regions, leading to a more concise overall summary. We study the problem of generating minimal region descriptions under the GMDL principle for two different scenarios. In the first, all dimensions of the data space are spatial. In the second scenario, all dimentions are categorial and organized in hierarchies. We propose region finding algorithms for both scenarios and evaluate their run time and compression performance using detailed experimentation. Our results show the effectiveness of the GMDL principle and the proposed algorithms.