Set similarity join on probabilistic data

Authors:
Xiang Lian;Lei Chen
Affiliations:
The Hong Kong University of Science and Technology, Kowloon, Hong Kong, China;The Hong Kong University of Science and Technology, Kowloon, Hong Kong, China
Venue:
Proceedings of the VLDB Endowment
Year:
2010

Citing 22
Cited 3

The impact of poor data quality on the typical enterprise

Communications of the ACM
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
MYSTIQ: a system for finding more answers by using probabilities

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
U-DBMS: a database system for managing constantly-evolving data

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
ULDBs: databases with uncertainty and lineage

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Creating probabilistic databases from information extraction models

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Efficient join processing over uncertain data

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Efficient query evaluation on probabilistic databases

The VLDB Journal — The International Journal on Very Large Data Bases
Metric space similarity joins

ACM Transactions on Database Systems (TODS)
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
MCDB: a monte carlo approach to managing uncertain data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
BayesStore: managing large, uncertain data repositories with probabilistic graphical models

Proceedings of the VLDB Endowment
Data integration with uncertainty

The VLDB Journal — The International Journal on Very Large Data Bases
Top-k Spatial Joins of Probabilistic Objects

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Canopy closure estimates with GreenOrbs: sustainable sensing in the forest

Proceedings of the 7th ACM Conference on Embedded Networked Sensor Systems
Probabilistic string similarity joins

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Probabilistic similarity join on uncertain data

DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications

Efficient processing of probabilistic set-containment queries on uncertain set-valued data

Information Sciences: an International Journal
Mining probabilistically frequent sequential patterns in uncertain databases

Proceedings of the 15th International Conference on Extending Database Technology
Efficient processing of containment queries on nested sets

Proceedings of the 16th International Conference on Extending Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Set similarity join has played an important role in many real-world applications such as data cleaning, near duplication detection, data integration, and so on. In these applications, set data often contain noises and are thus uncertain and imprecise. In this paper, we model such probabilistic set data on two uncertainty levels, that is, set and element levels. Based on them, we investigate the problem of probabilistic set similarity join (PS2J) over two probabilistic set databases, under the possible worlds semantics. To efficiently process the PS2J operator, we first reduce our problem by condensing the possible worlds, and then propose effective pruning techniques, including Jaccard distance pruning, probability upper bound pruning, and aggregate pruning, which can filter out false alarms of probabilistic set pairs, with the help of indexes and our designed synopses. We demonstrate through extensive experiments the PS2J processing performance on both real and synthetic data.