A Bipartite Graph Framework for Summarizing High-Dimensional Binary, Categorical and Numeric Data

  • Authors:
  • Guanhua Chen;Xiuli Ma;Dongqing Yang;Shiwei Tang;Meng Shuai

  • Affiliations:
  • School of Electronics Engineering and Computer Science, Peking University, Beijing, China 100871;School of Electronics Engineering and Computer Science, Peking University, Beijing, China 100871 and Key Laboratory of Machine Perception (Ministry of Education), Peking University, Beijing, China ...;School of Electronics Engineering and Computer Science, Peking University, Beijing, China 100871 and Key Laboratory of High Confidence Software Technologies (Ministry of Education), Peking Univers ...;School of Electronics Engineering and Computer Science, Peking University, Beijing, China 100871 and Key Laboratory of Machine Perception (Ministry of Education), Peking University, Beijing, China ...;Key Laboratory of Machine Perception (Ministry of Education), Peking University, Beijing, China 100871

  • Venue:
  • SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
  • Year:
  • 2009

Quantified Score

Hi-index 0.02

Visualization

Abstract

Data summarization is an important data mining task which aims to find a compact description of a dataset. Emerging applications place special requirements to the data summarization techniques including the ability to find concise and informative summary from high dimensional data, the ability to deal with different types of attributes such as binary, categorical and numeric attributes, end-user comprehensibility of the summary, insensibility to noise and missing values and scalability with the data size and dimensionality. In this work, a general framework that satisfies all of these requirements is proposed to summarize high-dimensional data. We formulate this problem in a bipartite graph scheme, mapping objects (data records) and values of attributes into two disjoint groups of nodes of a graph, in which a set of representative objects is discovered as the summary of the original data. Further, the capability of representativeness is measured using the MDL principle, which helps to yield a highly intuitive summary with the most informative objects of the input data. While the problem of finding the optimal summary with minimal representation cost is computationally infeasible, an approximate optimal summary is achieved by a heuristic algorithm whose computation cost is quadratic to the size of data and linear to the dimensionality of data. In addition, several techniques are developed to improve both quality of the resultant summary and efficiency of the algorithm. A detailed study on both real and synthetic datasets shows the effectiveness and efficiency of our approach in summarizing high-dimensional datasets with binary, categorical and numeric attributes.