A performance study of three disk-based structures for indexing and querying frequent itemsets

Authors:
Guimei Liu;Andre Suchitra;Limsoon Wong
Affiliations:
School of Computing, National University of Singapore;School of Computing, National University of Singapore;School of Computing, National University of Singapore
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 21
Cited 0

Inverted files

Information retrieval
Signature files

Information retrieval
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Rapid bushy join-order optimization with Cartesian products

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Inverted files versus signature files for text indexing

ACM Transactions on Database Systems (TODS)
Using a knowledge cache for interactive discovery of association rules

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
The implementation and performance of compressed databases

ACM SIGMOD Record
KDD-Cup 2000 organizers' report: peeling the onion

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
Online Generation of Association Rules

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Discovering Frequent Closed Itemsets for Association Rules

ICDT '99 Proceedings of the 7th International Conference on Database Theory
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Querying multiple sets of discovered rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A performance study of four index structures for set-valued attributes of low cardinality

The VLDB Journal — The International Journal on Very Large Data Bases
On computing, storing and querying frequent patterns

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining compressed frequent-pattern sets

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Rule interestingness analysis using OLAP operations

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Opportunity map: identifying causes of failure - a deployed data mining system

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
CFP-tree: A compact disk-based structure for storing and querying frequent itemsets

Information Systems
A Signature-Based Indexing Method for Efficient Content-Based Retrieval of Relative Temporal Patterns

IEEE Transactions on Knowledge and Data Engineering
Towards exploratory hypothesis testing and analysis

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Frequent itemset mining is an important problem in the data mining area. Extensive efforts have been devoted to developing efficient algorithms for mining frequent itemsets. However, not much attention is paid on managing the large collection of frequent itemsets produced by these algorithms for subsequent analysis and for user exploration. In this paper, we study three structures for indexing and querying frequent itemsets: inverted files, signature files and CFP-tree. The first two structures have been widely used for indexing general set-valued data. We make some modifications to make them more suitable for indexing frequent itemsets. The CFP-tree structure is specially designed for storing frequent itemsets. We add a pruning technique based on length-2 frequent itemsets to make it more efficient for processing superset queries. We study the performance of the three structures in supporting five types of containment queries: exact match, subset/superset search and immediate subset/superset search. Our results show that no structure can outperform other structures for all the five types of queries on all the datasets. CFP-tree shows better overall performance than the other two structures.