Elements of information theory
Elements of information theory
Machine Learning
Feature Selection for Knowledge Discovery and Data Mining
Feature Selection for Knowledge Discovery and Data Mining
Alternative Interest Measures for Mining Associations in Databases
IEEE Transactions on Knowledge and Data Engineering
CoMine: Efficient Mining of Correlated Patterns
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Information-theoretic co-clustering
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Testing the significance of attribute interactions
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Fast Binary Feature Selection with Conditional Mutual Information
The Journal of Machine Learning Research
IEEE Transactions on Pattern Analysis and Machine Intelligence
A general model for clustering binary data
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Mining quantitative correlated patterns using an information-theoretic approach
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Maximally informative k-itemsets and their efficient discovery
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
What is the Dimension of Your Binary Data?
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Finding low-entropy sets and trees from binary data
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
IEEE Transactions on Knowledge and Data Engineering
Searching for interacting features
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
An Improved Algorithm for Mining Non-Redundant Interacting Feature Subsets
APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Discovering highly informative feature sets from data streams
DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part I
Contrasting correlations by an efficient double-clique condition
MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Top-N minimization approach for indicative correlation change mining
MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
Hi-index | 0.00 |
Many approaches have been proposed to find correlations in binary data. Usually, these methods focus on pair-wise correlations. In biology applications, it is important to find correlations that involve more than just two features. Moreover, a set of strongly correlated features should be non-redundant in the sense that the correlation is strong only when all the interacting features are considered together. Removing any feature will greatly reduce the correlation. In this paper, we explore the problem of finding non-redundant high order correlations in binary data. The high order correlations are formalized using multi-information, a generalization of pairwise mutual information. To reduce the redundancy, we require any subset of a strongly correlated feature subset to be weakly correlated. Such feature subsets are referred to as Non-redundant Interacting Feature Subsets (NIFS). Finding all NIFSs is computationally challenging, because in addition to enumerating feature combinations, we also need to check all their subsets for redundancy. We study several properties of NIFSs and show that these properties are useful in developing efficient algorithms. We further develop two sets of upper and lower bounds on the correlations, which can be incorporated in the algorithm to prune the search space. A simple and effective pruning strategy based on pair-wise mutual information is also developed to further prune the search space. The efficiency and effectiveness of our approach are demonstrated through extensive experiments on synthetic and real-life datasets.