Mining non-redundant high order correlations in binary data

Authors:
Xiang Zhang;Feng Pan;Wei Wang;Andrew Nobel
Affiliations:
University of North Carolina at Chapel Hill;University of North Carolina at Chapel Hill;University of North Carolina at Chapel Hill;University of North Carolina at Chapel Hill
Venue:
Proceedings of the VLDB Endowment
Year:
2008

Citing 17
Cited 4

Elements of information theory

Elements of information theory
Machine Learning

Machine Learning
Feature Selection for Knowledge Discovery and Data Mining

Feature Selection for Knowledge Discovery and Data Mining
Alternative Interest Measures for Mining Associations in Databases

IEEE Transactions on Knowledge and Data Engineering
CoMine: Efficient Mining of Correlated Patterns

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Testing the significance of attribute interactions

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Fast Binary Feature Selection with Conditional Mutual Information

The Journal of Machine Learning Research
Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy

IEEE Transactions on Pattern Analysis and Machine Intelligence
A general model for clustering binary data

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Mining quantitative correlated patterns using an information-theoretic approach

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Maximally informative k-itemsets and their efficient discovery

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
What is the Dimension of Your Binary Data?

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Finding low-entropy sets and trees from binary data

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
The Discrete Basis Problem

IEEE Transactions on Knowledge and Data Engineering
Searching for interacting features

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence

An Improved Algorithm for Mining Non-Redundant Interacting Feature Subsets

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Discovering highly informative feature sets from data streams

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part I
Contrasting correlations by an efficient double-clique condition

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Top-N minimization approach for indicative correlation change mining

MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many approaches have been proposed to find correlations in binary data. Usually, these methods focus on pair-wise correlations. In biology applications, it is important to find correlations that involve more than just two features. Moreover, a set of strongly correlated features should be non-redundant in the sense that the correlation is strong only when all the interacting features are considered together. Removing any feature will greatly reduce the correlation. In this paper, we explore the problem of finding non-redundant high order correlations in binary data. The high order correlations are formalized using multi-information, a generalization of pairwise mutual information. To reduce the redundancy, we require any subset of a strongly correlated feature subset to be weakly correlated. Such feature subsets are referred to as Non-redundant Interacting Feature Subsets (NIFS). Finding all NIFSs is computationally challenging, because in addition to enumerating feature combinations, we also need to check all their subsets for redundancy. We study several properties of NIFSs and show that these properties are useful in developing efficient algorithms. We further develop two sets of upper and lower bounds on the correlations, which can be incorporated in the algorithm to prune the search space. A simple and effective pruning strategy based on pair-wise mutual information is also developed to further prune the search space. The efficiency and effectiveness of our approach are demonstrated through extensive experiments on synthetic and real-life datasets.