A novel hash-based approach for mining frequent itemsets over data streams requiring less memory space

Authors:
En Tzu Wang;Arbee L. Chen
Affiliations:
Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan, ROC;Department of Computer Science, National Chengchi University, Taipei, Taiwan, ROC
Venue:
Data Mining and Knowledge Discovery
Year:
2009

Citing 25
Cited 5

Web server workload characterization: the search for invariants

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
What's hot and what's not: tracking most frequent items dynamically

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Identifying frequent items in sliding windows over on-line packet streams

Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
Finding recent frequent itemsets adaptively over online data streams

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Dynamically maintaining frequent items over a data stream

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
An Algorithm for In-Core Frequent Itemset Mining on Streaming Data

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Finding Maximal Frequent Itemsets over Online Data Streams Adaptively

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A simpler and more efficient deterministic scheme for finding frequent items over sliding windows

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
CFI-Stream: mining closed frequent itemsets in data streams

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
DSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
A New Algorithm for Maintaining Closed Frequent Itemsets in Data Streams by Incremental Updates

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Finding Frequent Items in SlidingWindows over Data Streams Using EBF

SNPD '07 Proceedings of the Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing - Volume 03
False positive or false negative: mining frequent itemsets from high speed transactional data streams

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Online mining of frequent sets in data streams with error guarantee

Knowledge and Information Systems
Mining Frequent Itemsets in a Stream

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Verifying and Mining Frequent Patterns from Large Windows over Data Streams

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Finding frequent items in data streams using ESBF

PAKDD'07 Proceedings of the 2007 international conference on Emerging technologies in knowledge discovery and data mining
Maintaining frequent itemsets over high-speed data streams

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

Mining informative rule set for prediction over a sliding window

ACIIDS'10 Proceedings of the Second international conference on Intelligent information and database systems: Part II
Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis

Data Mining and Knowledge Discovery
Mining frequent patterns in a varying-size sliding window of online transactional data streams

Information Sciences: an International Journal
Trajectory mining from anonymous binary motion sensors in Smart Environment

Knowledge-Based Systems
Mining frequent itemsets in a stream

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent times, data are generated as a form of continuous data streams in many applications. Since handling data streams is necessary and discovering knowledge behind data streams can often yield substantial benefits, mining over data streams has become one of the most important issues. Many approaches for mining frequent itemsets over data streams have been proposed. These approaches often consist of two procedures including continuously maintaining synopses for data streams and finding frequent itemsets from the synopses. However, most of the approaches assume that the synopses of data streams can be saved in memory and ignore the fact that the information of the non-frequent itemsets kept in the synopses may cause memory utilization to be significantly degraded. In this paper, we consider compressing the information of all the itemsets into a structure with a fixed size using a hash-based technique. This hash-based approach skillfully summarizes the information of the whole data stream by using a hash table, provides a novel technique to estimate the support counts of the non-frequent itemsets, and keeps only the frequent itemsets for speeding up the mining process. Therefore, the goal of optimizing memory space utilization can be achieved. The correctness guarantee, error analysis, and parameter setting of this approach are presented and a series of experiments is performed to show the effectiveness and the efficiency of this approach.