Finding Representative Set from Massive Data

Authors:
Feng Pan;Wei Wang;Anthony K. H. Tung;Jiong Yang
Affiliations:
University of North Carolina at Chapel Hill;University of North Carolina at Chapel Hill;National University of Singapore;Case Western Reserve University
Venue:
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Year:
2005

Citing 7
Cited 6

Elements of information theory

Elements of information theory
Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
On clusterings-good, bad and spectral

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
A probabilistic framework for semi-supervised clustering

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A graph-theoretic approach to extract storylines from search results

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
SUMMARY: Efficiently Summarizing Transactions for Clustering

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining

An Improved Algorithm for Mining Non-Redundant Interacting Feature Subsets

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Using trees to depict a forest

Proceedings of the VLDB Endowment
Splash: ad-hoc querying of data and statistical models

Proceedings of the 13th International Conference on Extending Database Technology
A model for mining relevant and non-redundant information

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Finding representative nodes in probabilistic graphs

Bisociative Knowledge Discovery
Measuring the coverage and redundancy of information search services on e-commerce platforms

Electronic Commerce Research and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the information age, data is pervasive. In some applications, data explosion is a significant phenomenon. The massive data volume poses challenges to both human users and computers. In this project, we propose a new model for identifying representative set from a large database. A representative set is a special subset of the original dataset, which has three main characteristics: It is significantly smaller in size compared to the original dataset. It captures the most information from the original dataset compared to other subsets of the same size. It has low redundancy among the representatives it contains. We use information-theoretic measures such as mutual information and relative entropy to measure the representativeness of the representative set. We first design a greedy algorithm and then present a heuristic algorithm that delivers much better performance. We run experiments on two real datasets and evaluate the effectiveness of our representative set in terms of coverage and accuracy. The experiments show that our representative set attains expected characteristics and captures information more efficiently.