The impact of poor data quality on the typical enterprise
Communications of the ACM
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
MYSTIQ: a system for finding more answers by using probabilities
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
U-DBMS: a database system for managing constantly-evolving data
VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
ULDBs: databases with uncertainty and lineage
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Creating probabilistic databases from information extraction models
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Efficient join processing over uncertain data
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Efficient query evaluation on probabilistic databases
The VLDB Journal — The International Journal on Very Large Data Bases
ACM Transactions on Database Systems (TODS)
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
MCDB: a monte carlo approach to managing uncertain data
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
BayesStore: managing large, uncertain data repositories with probabilistic graphical models
Proceedings of the VLDB Endowment
Data integration with uncertainty
The VLDB Journal — The International Journal on Very Large Data Bases
Top-k Spatial Joins of Probabilistic Objects
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Canopy closure estimates with GreenOrbs: sustainable sensing in the forest
Proceedings of the 7th ACM Conference on Embedded Networked Sensor Systems
Probabilistic string similarity joins
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Probabilistic similarity join on uncertain data
DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
Efficient processing of probabilistic set-containment queries on uncertain set-valued data
Information Sciences: an International Journal
Mining probabilistically frequent sequential patterns in uncertain databases
Proceedings of the 15th International Conference on Extending Database Technology
Efficient processing of containment queries on nested sets
Proceedings of the 16th International Conference on Extending Database Technology
Hi-index | 0.00 |
Set similarity join has played an important role in many real-world applications such as data cleaning, near duplication detection, data integration, and so on. In these applications, set data often contain noises and are thus uncertain and imprecise. In this paper, we model such probabilistic set data on two uncertainty levels, that is, set and element levels. Based on them, we investigate the problem of probabilistic set similarity join (PS2J) over two probabilistic set databases, under the possible worlds semantics. To efficiently process the PS2J operator, we first reduce our problem by condensing the possible worlds, and then propose effective pruning techniques, including Jaccard distance pruning, probability upper bound pruning, and aggregate pruning, which can filter out false alarms of probabilistic set pairs, with the help of indexes and our designed synopses. We demonstrate through extensive experiments the PS2J processing performance on both real and synthetic data.