Benchmarking declarative approximate selection predicates

Authors:
Amit Chandel;Oktie Hassanzadeh;Nick Koudas;Mohammad Sadoghi;Divesh Srivastava
Affiliations:
University of Toronto, Toronto, ON, Canada;University of Toronto, Toronto, ON, Canada;University of Toronto, Toronto, ON, Canada;University of Toronto, Toronto, ON, Canada;AT&T Labs-Research, Florham Park, NJ
Venue:
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Year:
2007

Citing 18
Cited 25

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications

Simulation

BlogScope: spatio-temporal analysis of the blogosphere

Proceedings of the 16th international conference on World Wide Web
BlogScope: a system for online analysis of high volume text streams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Evaluating Performance and Quality of XML-Based Similarity Joins

ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Query by document

Proceedings of the Second ACM International Conference on Web Search and Data Mining
A grammar-based entity representation framework for data cleaning

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Incremental maintenance of length normalized indexes for approximate string matching

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A cluster-based approach to XML similarity joins

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
BibBase triplified

Proceedings of the 6th International Conference on Semantic Systems
Generalizing prefix filtering to improve set similarity joins

Information Systems
Efficient duplicate record detection based on similarity estimation

WAIM'10 Proceedings of the 11th international conference on Web-age information management
An efficient similarity join algorithm with cosine similarity predicate

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Approximate entity extraction in temporal databases

World Wide Web
Helix: online enterprise data analytics

Proceedings of the 20th international conference companion on World wide web
BE-tree: an index structure to efficiently match boolean expressions over high-dimensional discrete space

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient top-K approximate searches against a relation with multiple attributes

World Wide Web
Ranking-based processing of SQL queries

Proceedings of the 20th ACM international conference on Information and knowledge management
Set-Similarity joins based semi-supervised sentiment analysis

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part I
Analysis and optimization for boolean expression indexing

ACM Transactions on Database Systems (TODS)
Entity resolution on uncertain relations

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)
Publishing bibliographic data on the Semantic Web using BibBase

Semantic Web - Linked Data for science and education

Quantified Score

Hi-index	0.00

Visualization

Abstract

Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use and integration with existing applications. Over the last few years several similarity predicates have been proposed for common quality primitives (approximate selections, joins, etc) and have been fully expressed using declarative SQL statements. In this paper we propose new similarity predicates along with their declarative realization, based on notions of probabilistic information retrieval. In particular we show how language models and hidden Markov models can be utilized as similarity predicates for data quality and present their full declarative instantiation. We also show how other scoring methods from information retrieval, can be utilized in a similar setting. We then present full declarative specifications of previously proposed similarity predicates in the literature, grouping them into classes according to their primary characteristics. Finally, we present a thorough performance and accuracy study comparing a large number of similarity predicates for data cleaning operations. We quantify both their runtime performance as well as their accuracy for several types of common quality problems encountered in operational databases.