Creating probabilistic databases from duplicated data

Authors:
Oktie Hassanzadeh;Renée J. Miller
Affiliations:
Department of Computer Science, University of Toronto, Toronto, Canada;Department of Computer Science, University of Toronto, Toronto, Canada
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2009

Citing 32
Cited 8

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Locality-preserving hashing in multidimensional spaces

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Pattern Recognition with Fuzzy Objective Function Algorithms

Pattern Recognition with Fuzzy Objective Function Algorithms
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Latent dirichlet allocation

The Journal of Machine Learning Research
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Correlation Clustering

Machine Learning
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Aggregating inconsistent information: ranking and clustering

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
MYSTIQ: a system for finding more answers by using probabilities

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Clean Answers over Dirty Databases: A Probabilistic Approach

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Creating probabilistic databases from information extraction models

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Correlation clustering in general weighted graphs

Theoretical Computer Science - Approximation and online algorithms
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Leveraging aggregate constraints for deduplication

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Management of probabilistic data: foundations and challenges

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient query evaluation on probabilistic databases

The VLDB Journal — The International Journal on Very Large Data Bases
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Data integration with uncertainty

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Representing Tuple and Attribute Uncertainty in Probabilistic Databases

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Probabilistic top-k and ranking-aggregate queries

ACM Transactions on Database Systems (TODS)
Cleaning uncertain data with quality guarantees

Proceedings of the VLDB Endowment
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Fast and Simple Relational Processing of Uncertain Data

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Large-Scale Deduplication with Constraints Using Dedupalog

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering

Framework for evaluating clustering algorithms in duplicate detection

Proceedings of the VLDB Endowment
Tractability in probabilistic databases

Proceedings of the 14th International Conference on Database Theory
On probabilistic models for uncertain sequential pattern mining

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Mining sequential patterns from probabilistic databases

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Mining sequential patterns from probabilistic databases by pattern-growth

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Uncertainty in sequential pattern mining

BNCOD'10 Proceedings of the 27th British national conference on Data Security and Security Data
MUD: Mapping-based query processing for high-dimensional uncertain data

Information Sciences: an International Journal
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.01

Visualization

Abstract

A major source of uncertainty in databases is the presence of duplicate items, i.e., records that refer to the same real-world entity. However, accurate deduplication is a difficult task and imperfect data cleaning may result in loss of valuable information. A reasonable alternative approach is to keep duplicates when the correct cleaning strategy is not certain, and utilize an efficient probabilistic query-answering technique to return query results along with probabilities of each answer being correct. In this paper, we present a flexible modular framework for scalably creating a probabilistic database out of a dirty relation of duplicated data and overview the challenges raised in utilizing this framework for large relations of string data. We study the problem of associating probabilities with duplicates that are detected using state-of-the-art scalable approximate join methods. We argue that standard thresholding techniques are not sufficiently robust for this task, and propose new clustering algorithms suitable for inferring duplicates and their associated probabilities. We show that the inferred probabilities accurately reflect the error in duplicate records.