Techniques for automatically correcting words in text
ACM Computing Surveys (CSUR)
Locality-preserving hashing in multidimensional spaces
STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
A hidden Markov model information retrieval system
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Pattern Recognition with Fuzzy Objective Function Algorithms
Pattern Recognition with Fuzzy Objective Function Algorithms
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
The Journal of Machine Learning Research
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Machine Learning
Robust Identification of Fuzzy Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Aggregating inconsistent information: ranking and clustering
Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
MYSTIQ: a system for finding more answers by using probabilities
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Clean Answers over Dirty Databases: A Probabilistic Approach
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Creating probabilistic databases from information extraction models
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Correlation clustering in general weighted graphs
Theoretical Computer Science - Approximation and online algorithms
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Leveraging aggregate constraints for deduplication
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Management of probabilistic data: foundations and challenges
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient query evaluation on probabilistic databases
The VLDB Journal — The International Journal on Very Large Data Bases
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Data integration with uncertainty
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Representing Tuple and Attribute Uncertainty in Probabilistic Databases
ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Probabilistic top-k and ranking-aggregate queries
ACM Transactions on Database Systems (TODS)
Cleaning uncertain data with quality guarantees
Proceedings of the VLDB Endowment
Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
Fast and Simple Relational Processing of Uncertain Data
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Large-Scale Deduplication with Constraints Using Dedupalog
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Framework for evaluating clustering algorithms in duplicate detection
Proceedings of the VLDB Endowment
Tractability in probabilistic databases
Proceedings of the 14th International Conference on Database Theory
On probabilistic models for uncertain sequential pattern mining
ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Mining sequential patterns from probabilistic databases
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Mining sequential patterns from probabilistic databases by pattern-growth
BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Uncertainty in sequential pattern mining
BNCOD'10 Proceedings of the 27th British national conference on Data Security and Security Data
MUD: Mapping-based query processing for high-dimensional uncertain data
Information Sciences: an International Journal
Extending string similarity join to tolerant fuzzy token matching
ACM Transactions on Database Systems (TODS)
Hi-index | 0.01 |
A major source of uncertainty in databases is the presence of duplicate items, i.e., records that refer to the same real-world entity. However, accurate deduplication is a difficult task and imperfect data cleaning may result in loss of valuable information. A reasonable alternative approach is to keep duplicates when the correct cleaning strategy is not certain, and utilize an efficient probabilistic query-answering technique to return query results along with probabilities of each answer being correct. In this paper, we present a flexible modular framework for scalably creating a probabilistic database out of a dirty relation of duplicated data and overview the challenges raised in utilizing this framework for large relations of string data. We study the problem of associating probabilities with duplicates that are detected using state-of-the-art scalable approximate join methods. We argue that standard thresholding techniques are not sufficiently robust for this task, and propose new clustering algorithms suitable for inferring duplicates and their associated probabilities. We show that the inferred probabilities accurately reflect the error in duplicate records.