Significance and recovery of block structures in binary matrices with noise

Authors:
Xing Sun;Andrew Nobel
Affiliations:
Department of Statistics and Operation Research;Department of Statistics and Operation Research
Venue:
COLT'06 Proceedings of the 19th annual conference on Learning Theory
Year:
2006

Citing 16
Cited 0

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Fast discovery of association rules

Advances in knowledge discovery and data mining
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Principles of data mining

Principles of data mining
Efficient discovery of error-tolerant frequent itemsets in high dimensions

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
On bipartite and multipartite clique problems

Journal of Algorithms
Biclustering of Expression Data

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Stability-based validation of clustering solutions

Neural Computation
Fully automatic cross-associations

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Dense itemsets

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Biclustering Gene-Feature Matrices for Statistically Significant Dense Patterns

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Biclustering Algorithms for Biological Data Analysis: A Survey

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Mining condensed frequent-pattern bases

Knowledge and Information Systems
Mining Approximate Frequent Itemsets from Noisy Data

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Frequent itemset mining (FIM) is one of the core problems in the field of Data Mining and occupies a central place in its literature. One equivalent form of FIM can be stated as follows: given a rectangular data matrix with binary entries, find every submatrix of 1s having a minimum number of columns. This paper presents a theoretical analysis of several statistical questions related to this problem when noise is present. We begin by establishing several results concerning the extremal behavior of submatrices of ones in a binary matrix with random entries. These results provide simple significance bounds for the output of FIM algorithms. We then consider the noise sensitivity of FIM algorithms under a simple binary additive noise model, and show that, even at small noise levels, large blocks of 1s leave behind fragments of only logarithmic size. Thus such blocks cannot be directly recovered by FIM algorithms, which search for submatrices of all 1s. On the positive side, we show how, in the presence of noise, an error-tolerant criterion can recover a square submatrix of 1s against a background of 0s, even when the size of the target submatrix is very small.