A lower bound on the sample size needed to perform a significant frequent pattern mining task

Authors:
Stéphanie Jacquemont;François Jacquenet;Marc Sebban
Affiliations:
Laboratoire Hubert Curien, Université de Saint-Etienne, 18 rue du Professeur Lauras, 42000 Saint-Etienne, France;Laboratoire Hubert Curien, Université de Saint-Etienne, 18 rue du Professeur Lauras, 42000 Saint-Etienne, France;Laboratoire Hubert Curien, Université de Saint-Etienne, 18 rue du Professeur Lauras, 42000 Saint-Etienne, France
Venue:
Pattern Recognition Letters
Year:
2009

Citing 25
Cited 3

Sequence mining in categorical domains: incorporating constraints

Proceedings of the ninth international conference on Information and knowledge management
Emerging scientific applications in data mining

Communications of the ACM - Evolving data mining into solutions for insights
Using finite state automata for sequence mining

ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Mining sequential patterns with constraints in large databases

Proceedings of the eleventh international conference on Information and knowledge management
Discovery of Frequent Episodes in Event Sequences

Data Mining and Knowledge Discovery
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules

Data Mining and Knowledge Discovery
Mining Sequential Patterns with Regular Expression Constraints

IEEE Transactions on Knowledge and Data Engineering
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Sequential PAttern mining using a bitmap representation

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of sampling for data mining of association rules

RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Advances in frequent itemset mining implementations: report on FIMI'03

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Link mining: a survey

ACM SIGKDD Explorations Newsletter
Graph mining: Laws, generators, and algorithms

ACM Computing Surveys (CSUR)
Mining evolving data streams for frequent patterns

Pattern Recognition
Efficiently Mining Frequent Embedded Unordered Trees

Fundamenta Informaticae - Advances in Mining Graphs, Trees and Sequences
Frequent Subtree Mining - An Overview

Fundamenta Informaticae - Advances in Mining Graphs, Trees and Sequences
Discovering Significant Patterns

Machine Learning
Frequent pattern mining: current status and future directions

Data Mining and Knowledge Discovery
DryadeParent, An Efficient and Robust Closed Attribute Tree Mining Algorithm

IEEE Transactions on Knowledge and Data Engineering
Statistical supports for mining sequential patterns and improving the incremental update process on data streams

Intelligent Data Analysis - Knowlegde Discovery from Data Streams
Finding reliable subgraphs from large probabilistic graphs

Data Mining and Knowledge Discovery
Sampling for Sequential Pattern Mining: From Static Databases to Data Streams

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
An integrated, generic approach to pattern mining: data mining template library

Data Mining and Knowledge Discovery
Mining frequent cross-graph quasi-cliques

ACM Transactions on Knowledge Discovery from Data (TKDD)

Discovering Patterns in Flows: A Privacy Preserving Approach with the ACSM Prototype

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
Frequent subgraph mining on a single large graph using sampling techniques

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
Learning theory analysis for association rules and sequential event prediction

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.10

Visualization

Abstract

During the past few years, the problem of assessing the statistical significance of frequent patterns extracted from a given set S of data has received much attention. Considering that S always consists of a sample drawn from an unknown underlying distribution, two types of risks can arise during a frequent pattern mining process: accepting a false frequent pattern or rejecting a true one. In this context, many approaches presented in the literature assume that the dataset size is an application-dependent parameter. In this case, there is a trade-off between both errors leading to solutions that only control one risk to the detriment of the other one. On the other hand, many sampling-based methods have attempted to determine the optimal size of S ensuring a good approximation of the original (potentially infinite) database from which S is drawn. However, these approaches often resort to Chernoff bounds that do not allow the independent control of the two risks. In this paper, we overcome the mentioned drawbacks by providing a lower bound on the sample size required to control both risks and achieve a significant frequent pattern mining task.