TSum: fast, principled table summarization

Authors:
Jieying Chen;Jia-Yu Pan;Christos Faloutsos;Spiros Papadimitriou
Affiliations:
Google, Inc.;Google, Inc.;Carnegie Mellon University;Rutgers University
Venue:
Proceedings of the Seventh International Workshop on Data Mining for Online Advertising
Year:
2013

Citing 11
Cited 0

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Efficiently supporting ad hoc queries in large datasets of time sequences

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Optimal partial-match retrieval when fields are independently specified

ACM Transactions on Database Systems (TODS)
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications

Data Mining and Knowledge Discovery
Approximating a collection of frequent sets

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic Subspace Clustering of High Dimensional Data

Data Mining and Knowledge Discovery
Summarizing itemset patterns: a profile-based approach

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Discovering significant rules

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a table where rows correspond to records and columns correspond to attributes, we want to find a small number of patterns that succinctly summarize the dataset. For example, given a set of patient records with several attributes each, how can we find (a) that the "most representative" pattern is, say, (male, adult, *), followed by (*, child, low-cholesterol), etc? We propose TSum, a method that provides a sequence of patterns ordered by their "representativeness." It can decide both which these patterns are, as well as how many are necessary to properly summarize the data. Our main contribution is formulating a general framework, TSum, using compression principles. TSum can easily accommodate different optimization strategies for selecting and refining patterns. The discovered patterns can be used to both represent the data efficiently, as well as interpret it quickly. Extensive experiments demonstrate the effectiveness and intuitiveness of our discovered patterns.