New sampling-based summary statistics for improving approximate query answers
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments
Journal of Computer and System Sciences
Efficient computation of Iceberg cubes with complex measures
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
New directions in traffic measurement and accounting
Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Computing Iceberg Queries Efficiently
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
A simple algorithm for finding frequent elements in streams and bags
ACM Transactions on Database Systems (TODS)
What's hot and what's not: tracking most frequent items dynamically
Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Evaluating probabilistic queries over imprecise data
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Dynamically maintaining frequent items over a data stream
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Finding (Recently) Frequent Items in Distributed Data Streams
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
ConQuer: efficient management of inconsistent databases
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Indexing multi-dimensional uncertain data with arbitrary probability density functions
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Working Models for Uncertain Data
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A simpler and more efficient deterministic scheme for finding frequent items over sliding windows
Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Data integration: the teenage years
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
ULDBs: databases with uncertainty and lineage
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Trio: a system for data, uncertainty, and lineage
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
An integrated efficient solution for computing frequent and top-k elements in data streams
ACM Transactions on Database Systems (TODS)
Sketching probabilistic data streams
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
From complete to incomplete information and back
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Estimating statistical aggregates on probabilistic data streams
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient query evaluation on probabilistic databases
The VLDB Journal — The International Journal on Very Large Data Bases
Approximate frequency counts over data streams
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Model-driven data acquisition in sensor networks
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient query evaluation on probabilistic databases
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Probabilistic skylines on uncertain data
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Materialized views in probabilistic databases: for information exchange and query optimization
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Online Filtering, Smoothing and Probabilistic Modeling of Streaming data
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Sliding-window top-k queries on uncertain streams
Proceedings of the VLDB Endowment
Frequent pattern mining with uncertain data
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic frequent itemset mining in uncertain databases
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient algorithms for mining constrained frequent patterns from uncertain data
Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data
Mining uncertain data for constrained frequent sets
IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Frequent subgraph pattern mining on uncertain graph data
Proceedings of the 18th ACM conference on Information and knowledge management
Efficient join processing on uncertain data streams
Proceedings of the 18th ACM conference on Information and knowledge management
Local query mining in a probabilistic prolog
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Mining uncertain data for frequent itemsets that satisfy aggregate constraints
Proceedings of the 2010 ACM Symposium on Applied Computing
Efficient algorithms for the mining of constrained frequent patterns from uncertain data
ACM SIGKDD Explorations Newsletter
Sliding-window top-k queries on uncertain streams
The VLDB Journal — The International Journal on Very Large Data Bases
Mining uncertain data with probabilistic guarantees
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering frequent subgraphs over uncertain graph databases under probabilistic semantics
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Metric spaces in data mining: applications to clustering
SIGSPATIAL Special
uCFS2: an enhanced system that mines uncertain data for constrained frequent sets
Proceedings of the Fourteenth International Database Engineering & Applications Symposium
Accelerating probabilistic frequent itemset mining: a model-based approach
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Data selection for exact value acquisition to improve uncertain clustering
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Associative classifier for uncertain data
WAIM'10 Proceedings of the 11th international conference on Web-age information management
On probabilistic models for uncertain sequential pattern mining
ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Distributed frequent items detection on uncertain data
ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Handling ER-topk query on uncertain streams
DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Mining probabilistic frequent closed itemsets in uncertain databases
Proceedings of the 49th Annual Southeast Regional Conference
Mining sequential patterns from probabilistic databases
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
A practice probability frequent pattern mining method over transactional uncertain data streams
UIC'11 Proceedings of the 8th international conference on Ubiquitous intelligence and computing
Mining frequent patterns from univariate uncertain data
Data & Knowledge Engineering
Distributed mining of constrained frequent sets from uncertain data
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
Mining sequential patterns from probabilistic databases by pattern-growth
BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Fast mining erasable itemsets using NC_sets
Expert Systems with Applications: An International Journal
Efficiently answering probability threshold-based shortest path queries over uncertain graphs
DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Incremental update on probabilistic frequent itemsets in uncertain databases
Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Mining probabilistically frequent sequential patterns in uncertain databases
Proceedings of the 15th International Conference on Extending Database Technology
Fast tree-based mining of frequent itemsets from uncertain data
DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
An associative classifier for uncertain datasets
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Mining frequent itemsets over uncertain databases
Proceedings of the VLDB Endowment
Mining probabilistic datasets vertically
Proceedings of the 16th International Database Engineering & Applications Sysmposium
Probabilistic frequent pattern growth for itemset mining in uncertain databases
SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Mining frequent subgraphs over uncertain graph databases under probabilistic semantics
The VLDB Journal — The International Journal on Very Large Data Bases
Constrained frequent pattern mining on univariate uncertain data
Journal of Systems and Software
FGIT'12 Proceedings of the 4th international conference on Future Generation Information Technology
FARP: Mining fuzzy association rules from a probabilistic quantitative database
Information Sciences: an International Journal
Probabilistic k-skyband operator over sliding windows
WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Probabilistic skyline operator over sliding windows
Information Systems
Hi-index | 0.00 |
Computing statistical information on probabilistic data has attracted a lot of attention recently, as the data generated from a wide range of data sources are inherently fuzzy or uncertain. In this paper, we study an important statistical query on probabilistic data: finding the frequent items. One straightforward approach to identify the frequent items in a probabilistic data set is to simply compute the expected frequency of an item and decide if it exceeds a certain fraction of the expected size of the whole data set. However, this simple definition misses important information about the internal structure of the probabilistic data and the interplay among all the uncertain entities. Thus, we propose a new definition based on the possible world semantics that has been widely adopted for many query types in uncertain data management, trying to find all the items that are likely to be frequent in a randomly generated possible world. Our approach naturally leads to the study of ranking frequent items based on confidence as well. Finding likely frequent items in probabilistic data turns out to be much more difficult. We first propose exact algorithms for offline data with either quadratic or cubic time. Next, we design novel sampling-based algorithms for streaming data to find all approximately likely frequent items with theoretically guaranteed high probability and accuracy. Our sampling schemes consume sublinear memory and exhibit excellent scalability. Finally, we verify the effectiveness and efficiency of our algorithms using both real and synthetic data sets with extensive experimental evaluations.