Finding frequent items in probabilistic data

Authors:
Qin Zhang;Feifei Li;Ke Yi
Affiliations:
Hong Kong University of Science and Technology, Hong Kong, Hong Kong;Florida State University, Tallahassee, USA;Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Venue:
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Year:
2008

Citing 32
Cited 43

New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
Efficient computation of Iceberg cubes with complex measures

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
New directions in traffic measurement and accounting

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
What's hot and what's not: tracking most frequent items dynamically

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Evaluating probabilistic queries over imprecise data

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Dynamically maintaining frequent items over a data stream

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Finding (Recently) Frequent Items in Distributed Data Streams

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
ConQuer: efficient management of inconsistent databases

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Indexing multi-dimensional uncertain data with arbitrary probability density functions

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Working Models for Uncertain Data

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A simpler and more efficient deterministic scheme for finding frequent items over sliding windows

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Data integration: the teenage years

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
ULDBs: databases with uncertainty and lineage

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Trio: a system for data, uncertainty, and lineage

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
An integrated efficient solution for computing frequent and top-k elements in data streams

ACM Transactions on Database Systems (TODS)
Sketching probabilistic data streams

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
From complete to incomplete information and back

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Estimating statistical aggregates on probabilistic data streams

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient query evaluation on probabilistic databases

The VLDB Journal — The International Journal on Very Large Data Bases
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Model-driven data acquisition in sensor networks

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient query evaluation on probabilistic databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Probabilistic skylines on uncertain data

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Materialized views in probabilistic databases: for information exchange and query optimization

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Online Filtering, Smoothing and Probabilistic Modeling of Streaming data

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

Sliding-window top-k queries on uncertain streams

Proceedings of the VLDB Endowment
Frequent pattern mining with uncertain data

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic frequent itemset mining in uncertain databases

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient algorithms for mining constrained frequent patterns from uncertain data

Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data
Mining uncertain data for constrained frequent sets

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Frequent subgraph pattern mining on uncertain graph data

Proceedings of the 18th ACM conference on Information and knowledge management
Efficient join processing on uncertain data streams

Proceedings of the 18th ACM conference on Information and knowledge management
Local query mining in a probabilistic prolog

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Mining uncertain data for frequent itemsets that satisfy aggregate constraints

Proceedings of the 2010 ACM Symposium on Applied Computing
Efficient algorithms for the mining of constrained frequent patterns from uncertain data

ACM SIGKDD Explorations Newsletter
Sliding-window top-k queries on uncertain streams

The VLDB Journal — The International Journal on Very Large Data Bases
Mining uncertain data with probabilistic guarantees

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering frequent subgraphs over uncertain graph databases under probabilistic semantics

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Metric spaces in data mining: applications to clustering

SIGSPATIAL Special
uCFS2: an enhanced system that mines uncertain data for constrained frequent sets

Proceedings of the Fourteenth International Database Engineering & Applications Symposium
Accelerating probabilistic frequent itemset mining: a model-based approach

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Data selection for exact value acquisition to improve uncertain clustering

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Associative classifier for uncertain data

WAIM'10 Proceedings of the 11th international conference on Web-age information management
On probabilistic models for uncertain sequential pattern mining

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Distributed frequent items detection on uncertain data

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Handling ER-topk query on uncertain streams

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Mining probabilistic frequent closed itemsets in uncertain databases

Proceedings of the 49th Annual Southeast Regional Conference
Mining sequential patterns from probabilistic databases

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
A practice probability frequent pattern mining method over transactional uncertain data streams

UIC'11 Proceedings of the 8th international conference on Ubiquitous intelligence and computing
Mining frequent patterns from univariate uncertain data

Data & Knowledge Engineering
Distributed mining of constrained frequent sets from uncertain data

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
Mining sequential patterns from probabilistic databases by pattern-growth

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Fast mining erasable itemsets using NC_sets

Expert Systems with Applications: An International Journal
Efficiently answering probability threshold-based shortest path queries over uncertain graphs

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Incremental update on probabilistic frequent itemsets in uncertain databases

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Mining probabilistically frequent sequential patterns in uncertain databases

Proceedings of the 15th International Conference on Extending Database Technology
Fast tree-based mining of frequent itemsets from uncertain data

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
An associative classifier for uncertain datasets

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Mining frequent itemsets over uncertain databases

Proceedings of the VLDB Endowment
Mining probabilistic datasets vertically

Proceedings of the 16th International Database Engineering & Applications Sysmposium
Probabilistic frequent pattern growth for itemset mining in uncertain databases

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Mining uncertain data

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Mining frequent subgraphs over uncertain graph databases under probabilistic semantics

The VLDB Journal — The International Journal on Very Large Data Bases
Constrained frequent pattern mining on univariate uncertain data

Journal of Systems and Software
Uncertain OLAP over multidimensional data streams: state-of-the-art analysis and research perspectives

FGIT'12 Proceedings of the 4th international conference on Future Generation Information Technology
FARP: Mining fuzzy association rules from a probabilistic quantitative database

Information Sciences: an International Journal
Probabilistic k-skyband operator over sliding windows

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Probabilistic skyline operator over sliding windows

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computing statistical information on probabilistic data has attracted a lot of attention recently, as the data generated from a wide range of data sources are inherently fuzzy or uncertain. In this paper, we study an important statistical query on probabilistic data: finding the frequent items. One straightforward approach to identify the frequent items in a probabilistic data set is to simply compute the expected frequency of an item and decide if it exceeds a certain fraction of the expected size of the whole data set. However, this simple definition misses important information about the internal structure of the probabilistic data and the interplay among all the uncertain entities. Thus, we propose a new definition based on the possible world semantics that has been widely adopted for many query types in uncertain data management, trying to find all the items that are likely to be frequent in a randomly generated possible world. Our approach naturally leads to the study of ranking frequent items based on confidence as well. Finding likely frequent items in probabilistic data turns out to be much more difficult. We first propose exact algorithms for offline data with either quadratic or cubic time. Next, we design novel sampling-based algorithms for streaming data to find all approximately likely frequent items with theoretically guaranteed high probability and accuracy. Our sampling schemes consume sublinear memory and exhibit excellent scalability. Finally, we verify the effectiveness and efficiency of our algorithms using both real and synthetic data sets with extensive experimental evaluations.