A Bipartite Graph Framework for Summarizing High-Dimensional Binary, Categorical and Numeric Data

Authors:
Guanhua Chen;Xiuli Ma;Dongqing Yang;Shiwei Tang;Meng Shuai
Affiliations:
School of Electronics Engineering and Computer Science, Peking University, Beijing, China 100871;School of Electronics Engineering and Computer Science, Peking University, Beijing, China 100871 and Key Laboratory of Machine Perception (Ministry of Education), Peking University, Beijing, China ...;School of Electronics Engineering and Computer Science, Peking University, Beijing, China 100871 and Key Laboratory of High Confidence Software Technologies (Ministry of Education), Peking Univers ...;School of Electronics Engineering and Computer Science, Peking University, Beijing, China 100871 and Key Laboratory of Machine Perception (Ministry of Education), Peking University, Beijing, China ...;Key Laboratory of Machine Perception (Ministry of Education), Peking University, Beijing, China 100871
Venue:
SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Year:
2009

Citing 14
Cited 1

Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Multi-level organization and summarization of the discovered rules

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Advances in Automatic Text Summarization

Advances in Automatic Text Summarization
Mining Top.K Frequent Closed Patterns without Minimum Support

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Approximating a collection of frequent sets

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining and summarizing customer reviews

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
On efficiently summarizing categorical databases

Knowledge and Information Systems
Turning Clusters into Patterns: Rectangle-Based Discriminative Data Description

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Summarization – compressing data into an informative representation

Knowledge and Information Systems
The generalized MDL approach for summarization

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Compressing large boolean matrices using reordering techniques

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Graph summarization with bounded error

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient aggregation for graph summarization

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Succinct summarization of transactional databases: an overlapped hyperrectangle scheme

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Towards bipartite graph data management

CloudDB '10 Proceedings of the second international workshop on Cloud data management

Quantified Score

Hi-index	0.02

Visualization

Abstract

Data summarization is an important data mining task which aims to find a compact description of a dataset. Emerging applications place special requirements to the data summarization techniques including the ability to find concise and informative summary from high dimensional data, the ability to deal with different types of attributes such as binary, categorical and numeric attributes, end-user comprehensibility of the summary, insensibility to noise and missing values and scalability with the data size and dimensionality. In this work, a general framework that satisfies all of these requirements is proposed to summarize high-dimensional data. We formulate this problem in a bipartite graph scheme, mapping objects (data records) and values of attributes into two disjoint groups of nodes of a graph, in which a set of representative objects is discovered as the summary of the original data. Further, the capability of representativeness is measured using the MDL principle, which helps to yield a highly intuitive summary with the most informative objects of the input data. While the problem of finding the optimal summary with minimal representation cost is computationally infeasible, an approximate optimal summary is achieved by a heuristic algorithm whose computation cost is quadratic to the size of data and linear to the dimensionality of data. In addition, several techniques are developed to improve both quality of the resultant summary and efficiency of the algorithm. A detailed study on both real and synthetic datasets shows the effectiveness and efficiency of our approach in summarizing high-dimensional datasets with binary, categorical and numeric attributes.