The generalized MDL approach for summarization

Authors:
Laks V. S. Lakshmanan;Raymond T. Ng;Christine Xing Wang;Xiaodong Zhou;Theodore J. Johnson
Affiliations:
Univ. of British Columbia;Univ. of British Columbia;Univ. of British Columbia;Univ. of British Columbia;AT&T Labs-Research
Venue:
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Year:
2002

Citing 12
Cited 15

Covering a simple orthogonal polygon with a minimum number of orthogonally convex polygons

SCG '87 Proceedings of the third annual symposium on Computational geometry
Inferring decision trees using the minimum description length principle

Information and Computation
MDL-Based Segmentation and Motion Modeling in a Long Image Sequence of Scene with Multiple Independently Moving Objects

IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Bottom-up computation of sparse and Iceberg CUBE

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Iceberg-cube computation with PC clusters

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
On Optimal Node Splitting for R-trees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
MDL learning of unions of simple pattern languages from positive examples

EuroCOLT '95 Proceedings of the Second European Conference on Computational Learning Theory
Context models in the MDL framework

DCC '95 Proceedings of the Conference on Data Compression

Concise descriptions of subsets of structured sets

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Framework and algorithms for trend analysis in massive temporal data sets

Proceedings of the thirteenth ACM international conference on Information and knowledge management
MDL summarization with holes

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Efficient and effective explanation of change in hierarchical summaries

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Compressing rectilinear pictures and minimizing access control lists

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Finding hierarchical heavy hitters in data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Finding hierarchical heavy hitters in streaming data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Graph summarization with bounded error

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Succinct summarization of transactional databases: an overlapped hyperrectangle scheme

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Towards Data Mining Without Information on Knowledge Structure

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
A Bipartite Graph Framework for Summarizing High-Dimensional Binary, Categorical and Numeric Data

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Summarizing transactional databases with overlapped hyperrectangles

Data Mining and Knowledge Discovery
Handling inconsistencies in data warehouses

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
The class cover problem with boxes

Computational Geometry: Theory and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

There are many applications in OLAP and data analysis where we identify regions of interest. For example, in OLAP, an analysis query involving aggregate sales performance of various products in different locations and seasons could help identify interesting cells, such as cells of a data cube having an aggregate sales higher than a threshold. While a normal answer to such a quiry merely returns all interesting cells, it may be far more informative to the user if the system return summaries or descriptions of regions formed from the identified cells. The minimum Description Length (MDL) principle is a well-known strategy for finding such region descriptions. In this paper, we propose a generalization of the MDL principle, called GMDL, and show that GMDL leads to fewer regions than MDL, and hence more concise "answers" returned to the user. The key idea is that a region may contain "don't care" cells (up to a global maximum), if these "don't care" cells help to form bigger summary regions, leading to a more concise overall summary. We study the problem of generating minimal region descriptions under the GMDL principle for two different scenarios. In the first, all dimensions of the data space are spatial. In the second scenario, all dimentions are categorial and organized in hierarchies. We propose region finding algorithms for both scenarios and evaluate their run time and compression performance using detailed experimentation. Our results show the effectiveness of the GMDL principle and the proposed algorithms.