Automatic subspace clustering of high dimensional data for data mining applications
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Multi-level organization and summarization of the discovered rules
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Advances in Automatic Text Summarization
Advances in Automatic Text Summarization
Mining Top.K Frequent Closed Patterns without Minimum Support
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Approximating a collection of frequent sets
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining and summarizing customer reviews
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
On efficiently summarizing categorical databases
Knowledge and Information Systems
Turning Clusters into Patterns: Rectangle-Based Discriminative Data Description
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Summarization – compressing data into an informative representation
Knowledge and Information Systems
The generalized MDL approach for summarization
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Compressing large boolean matrices using reordering techniques
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Graph summarization with bounded error
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient aggregation for graph summarization
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Succinct summarization of transactional databases: an overlapped hyperrectangle scheme
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Towards bipartite graph data management
CloudDB '10 Proceedings of the second international workshop on Cloud data management
Hi-index | 0.02 |
Data summarization is an important data mining task which aims to find a compact description of a dataset. Emerging applications place special requirements to the data summarization techniques including the ability to find concise and informative summary from high dimensional data, the ability to deal with different types of attributes such as binary, categorical and numeric attributes, end-user comprehensibility of the summary, insensibility to noise and missing values and scalability with the data size and dimensionality. In this work, a general framework that satisfies all of these requirements is proposed to summarize high-dimensional data. We formulate this problem in a bipartite graph scheme, mapping objects (data records) and values of attributes into two disjoint groups of nodes of a graph, in which a set of representative objects is discovered as the summary of the original data. Further, the capability of representativeness is measured using the MDL principle, which helps to yield a highly intuitive summary with the most informative objects of the input data. While the problem of finding the optimal summary with minimal representation cost is computationally infeasible, an approximate optimal summary is achieved by a heuristic algorithm whose computation cost is quadratic to the size of data and linear to the dimensionality of data. In addition, several techniques are developed to improve both quality of the resultant summary and efficiency of the algorithm. A detailed study on both real and synthetic datasets shows the effectiveness and efficiency of our approach in summarizing high-dimensional datasets with binary, categorical and numeric attributes.