Solving inverse frequent itemset mining with infrequency constraints via large-scale linear programs

Authors:
Antonella Guzzo;Luigi Moccia;Domenico Saccà;Edoardo Serra
Affiliations:
University of Calabria, Rende (CS), Italy;Consiglio Nazionale delle Ricerche - Istituto di Calcolo e Reti ad Alte Prestazioni, Rende (CS), Italy;University of Calabria, Rende (CS), Italy;University of Calabria, Rende (CS), Italy
Venue:
ACM Transactions on Knowledge Discovery from Data (TKDD)
Year:
2013

Citing 47
Cited 0

A new polynomial-time algorithm for linear programming

Combinatorica
Identifying the Minimal Transversals of a Hypergraph and Related Problems

SIAM Journal on Computing
On the complexity of dualization of monotone disjunctive normal forms

Journal of Algorithms
An overview of data warehousing and OLAP technology

ACM SIGMOD Record
Data mining, hypergraph transversals, and machine learning (extended abstract)

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Privacy-preserving data mining

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
On the design and quantification of privacy preserving data mining algorithms

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A condensed representation to find frequent patterns

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Mining frequent patterns with counting inference

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
Real world performance of association rule algorithms

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Introduction to Linear Optimization

Introduction to Linear Optimization
Levelwise Search and Borders of Theories in KnowledgeDiscovery

Data Mining and Knowledge Discovery
Discovering Frequent Closed Itemsets for Association Rules

ICDT '99 Proceedings of the 7th International Conference on Database Theory
Concise Representation of Frequent Patterns Based on Disjunction-Free Generators

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Mining Frequent Itemsets Using Support Constraints

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Concise Representation of Frequent Patterns Based on Generalized Disjunction-Free Generators

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Revealing information while preserving privacy

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Limiting privacy breaches in privacy preserving data mining

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Feasible itemset distributions in data mining: theory and application

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Privacy preserving mining of association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On Maximal Frequent and Minimal Infrequent Sets in Binary Matrices

Annals of Mathematics and Artificial Intelligence
Protecting Sensitive Knowledge By Data Sanitization

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Data Privacy through Optimal k-Anonymization

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Computational complexity of itemset frequency satisfiability

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Distribution-Based Synthetic Database Generation Techniques for Itemset Mining

IDEAS '05 Proceedings of the 9th International Database Engineering & Application Symposium
Approximate Inverse Frequent Itemset Mining: Privacy, Complexity, and Approximation

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Parameterized Complexity Theory (Texts in Theoretical Computer Science. An EATCS Series)

Parameterized Complexity Theory (Texts in Theoretical Computer Science. An EATCS Series)
A reconstruction-based algorithm for classification rules hiding

ADC '06 Proceedings of the 17th Australasian Database Conference - Volume 49
Non-derivable itemset mining

Data Mining and Knowledge Discovery
Frequent pattern mining: current status and future directions

Data Mining and Knowledge Discovery
Maintaining data privacy in association rule mining

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
The complexity of satisfying constraints on databases of transactions

Acta Informatica
The applicability of the perturbation based privacy preserving data mining for real-world data

Data & Knowledge Engineering
Itemset frequency satisfiability: Complexity and axiomatization

Theoretical Computer Science
A new concise representation of frequent itemsets using generators and a positive border

Knowledge and Information Systems
Performance evaluation of evolutionary algorithms in classification of biomedical datasets

Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking Papers
Data Warehouse Design: Modern Principles and Methodologies

Data Warehouse Design: Modern Principles and Methodologies
An Effective Approach to Inverse Frequent Set Mining

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
A further study on inverse frequent set mining

ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
A survey on condensed representations for frequent sets

Proceedings of the 2004 European conference on Constraint-Based Mining and Inductive Databases
On parameterized approximability

IWPEC'06 Proceedings of the Second international conference on Parameterized and Exact Computation
A FP-tree-based method for inverse frequent set mining

BNCOD'06 Proceedings of the 23rd British National Conference on Databases, conference on Flexible and Efficient Information Handling
Count constraints and the inverse OLAP problem: definition, complexity and a step toward aggregate data exchange

FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Deciding monotone duality and identifying frequent itemsets in quadratic logspace

Proceedings of the 32nd symposium on Principles of database systems
Parameterized Complexity

Parameterized Complexity

Quantified Score

Hi-index	0.00

Visualization

Abstract

Inverse frequent set mining (IFM) is the problem of computing a transaction database D satisfying given support constraints for some itemsets, which are typically the frequent ones. This article proposes a new formulation of IFM, called IFMI (IFM with infrequency constraints), where the itemsets that are not listed as frequent are constrained to be infrequent; that is, they must have a support less than or equal to a specified unique threshold. An instance of IFMI can be seen as an instance of the original IFM by making explicit the infrequency constraints for the minimal infrequent itemsets, corresponding to the so-called negative generator border defined in the literature. The complexity increase from PSPACE (complexity of IFM) to NEXP (complexity of IFMI) is caused by the cardinality of the negative generator border, which can be exponential in the original input size. Therefore, the article introduces a specific problem parameter κ that computes an upper bound to this cardinality using a hypergraph interpretation for which minimal infrequent itemsets correspond to minimal transversals. By fixing a constant k, the article formulates a k-bounded definition of the problem, called k-IFMI, that collects all instances for which the value of the parameter κ is less than or equal to k—its complexity is in PSPACE as for IFM. The bounded problem is encoded as an integer linear program with a large number of variables (actually exponential w.r.t. the number of constraints), which is thereafter approximated by relaxing integer constraints—the decision problem of solving the linear program is proven to be in NP. In order to solve the linear program, a column generation technique is used that is a variation of the simplex method designed to solve large-scale linear programs, in particular with a huge number of variables. The method at each step requires the solution of an auxiliary integer linear program, which is proven to be NP hard in this case and for which a greedy heuristic is presented. The resulting overall column generation solution algorithm enjoys very good scaling as evidenced by the intensive experimentation, thereby paving the way for its application in real-life scenarios.