A Randomness Based Analysis on the Data Size Needed for Removing Deceptive Patterns

Authors:
Kazuya Haraguchi;Mutsunori Yagiura;Endre Boros;Toshihide Ibaraki
Affiliations:
-;-;-;-
Venue:
IEICE - Transactions on Information and Systems
Year:
2008

Citing 7
Cited 0

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
The nature of statistical learning theory

The nature of statistical learning theory
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
On the Complexity of Generating Maximal Frequent and Minimal Infrequent Sets

STACS '02 Proceedings of the 19th Annual Symposium on Theoretical Aspects of Computer Science
The complexity of mining maximal frequent itemsets and maximal frequent patterns

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Gaussian process models of spatial aggregation algorithms

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Effective sampling for mining association rules

AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider a data set in which each example is an n-dimensional Boolean vector labeled as true or false. A pattern is a co-occurrence of a particular value combination of a given subset of the variables. If a pattern appears frequently in the true examples and infrequently in the false examples, we consider it a good pattern. In this paper, we discuss the problem of determining the data size needed for removing “deceptive” good patterns; in a data set of a small size, many good patterns may appear superficially, simply by chance, independently of the underlying structure. Our hypothesis is that, in order to remove such deceptive good patterns, the data set should contain a greater number of examples than that at which a random data set contains few good patterns. We justify this hypothesis by computational studies. We also derive a theoretical upper bound on the needed data size in view of our hypothesis.