Using similarity-based operations for resolving data-level conflicts

Authors:
Eike Schallehn;Kai-Uwe Sattler
Affiliations:
Department of Computer Science, University of Magdeburg, Magdeburg, Germany;Department of Computer Science, University of Magdeburg, Magdeburg, Germany
Venue:
BNCOD'03 Proceedings of the 20th British national conference on Databases
Year:
2003

Citing 17
Cited 4

The breakdown of the information model in multi-database systems

ACM SIGMOD Record
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Fundamentals of database systems (2nd ed.)

Fundamentals of database systems (2nd ed.)
Probabilistic Datalog—a logic for powerful retrieval methods

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
A probabilistic relational model and algebra

ACM Transactions on Database Systems (TODS)
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
AJAX: an extensible data cleaning tool

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Tries for Approximate String Matching

IEEE Transactions on Knowledge and Data Engineering
A Database-Supported Workbench for Information Fusion: INFUSE

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
High-Dimensional Similarity Joins

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Entity Identification in Database Integration

Proceedings of the Ninth International Conference on Data Engineering
Using SQL to Build New Aggregates and Extenders for Object- Relational Systems

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Duplicate Removal in Information System Dissemination

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Reducing Inconsistency in Integrating Data From Different Sources

IDEAS '01 Proceedings of the International Database Engineering & Applications Symposium

Data fusion

ACM Computing Surveys (CSUR)
How Dirty Is Your Relational Database? An Axiomatic Approach

ECSQARU '07 Proceedings of the 9th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty
Estimating recall and precision for vague queries in databases

CAiSE'05 Proceedings of the 17th international conference on Advanced Information Systems Engineering
Policy-based inconsistency management in relational databases

International Journal of Approximate Reasoning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join seems to be a viable way, they fail if the attribute values of the potential duplicates or related tuples are not equal but only similar by certain criteria. As a solution to this problem, we present in this paper similarity-based variants of grouping and join operators. The extended grouping operator produces groups of similar tuples, the extended join combines tuples satisfying a given similarity condition. We describe the semantics of these operators, discuss efficient implementations for the edit distance similarity and present evaluation results. Finally, we give examples how the operators can be used in given application scenarios.