Merging the results of approximate match operations

Authors:
Sudipto Guha;Nick Koudas;Amit Marathe;Divesh Srivastava
Affiliations:
U of Pennsylvania;AT&T Labs-Research;AT&T Labs-Research;AT&T Labs-Research
Venue:
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Year:
2004

Citing 12
Cited 26

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Rank aggregation methods for the Web

Proceedings of the 10th international conference on World Wide Web
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Comparing top k lists

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Exploratory Data Mining and Data Cleaning

Exploratory Data Mining and Data Cleaning
Efficient similarity search and classification via rank aggregation

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data

RankSQL: query algebra and optimization for relational top-k queries

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Learning to deduplicate

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Database support for matching: limitations and opportunities

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Leveraging semantic technologies for enterprise search

Proceedings of the ACM first Ph.D. workshop in CIKM
A strategy for allowing meaningful and comparable scores in approximate matching

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Example-driven design of efficient record matching queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Rank-aware XML data model and algebra: towards unifying exact match and similar match in XML

MIV'07 Proceedings of the 7th Conference on 7th WSEAS International Conference on Multimedia, Internet & Video Technologies - Volume 7
Video linkage: group based copied video detection

CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
Social recommendations of content and metadata

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
The impact of parameter setup on a genetic programming approach to record deduplication

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Semantics and evaluation of top-k queries in probabilistic databases

Distributed and Parallel Databases
Optimal Stopping: A Record-Linkage Approach

Journal of Data and Information Quality (JDIQ)
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Reasoning about record matching rules

Proceedings of the VLDB Endowment
XML: some papers in a haystack

ACM SIGMOD Record
Approximate entity extraction in temporal databases

World Wide Web
Ontology and instance matching

Knowledge-driven multimedia information extraction and ontology evolution
Privacy preserving group linkage

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Dynamic constraints for record matching

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient top-K approximate searches against a relation with multiple attributes

World Wide Web
Efficient processing of ranked queries with sweeping selection

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
A machine learning approach for instance matching based on similarity metrics

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Comparing top-k XML lists

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data Cleaning is an important process that has been at the center of research interest in recent years. An important end goal of effective data cleaning is to identify the relational tuple or tuples that are "most related" to a given query tuple. Various techniques have been proposed in the literature for efficiently identifying approximate matches to a query string against a single attribute of a relation. In addition to constructing a ranking (i.e., ordering) of these matches, the techniques often associate, with each match, scores that quantify the extent of the match. Since multiple attributes could exist in the query tuple, issuing approximate match operations for each of them separately will effectively create a number of ranked lists of the relation tuples. Merging these lists to identify a final ranking and scoring, and returning the top-K tuples, is a challenging task. In this paper, we adapt the well-known footrule distance (for merging ranked lists) to effectively deal with scores. We study efficient algorithms to merge rankings, and produce the top-K tuples, in a declarative way. Since techniques for approximately matching a query string against a single attribute in a relation are typically best deployed in a database, we introduce and describe two novel algorithms for this problem and we provide SQL specifications for them. Our experimental case study, using real application data along with a realization of our proposed techniques on a commercial data base system, highlights the benefits of the proposed algorithms and attests to the overall effectiveness and practicality of our approach.