The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Rank aggregation methods for the Web
Proceedings of the 10th international conference on World Wide Web
Optimal aggregation algorithms for middleware
PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration
WWW '03 Proceedings of the 12th international conference on World Wide Web
Exploratory Data Mining and Data Cleaning
Exploratory Data Mining and Data Cleaning
Efficient similarity search and classification via rank aggregation
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
RankSQL: query algebra and optimization for relational top-k queries
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Database support for matching: limitations and opportunities
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Leveraging semantic technologies for enterprise search
Proceedings of the ACM first Ph.D. workshop in CIKM
A strategy for allowing meaningful and comparable scores in approximate matching
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Example-driven design of efficient record matching queries
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Rank-aware XML data model and algebra: towards unifying exact match and similar match in XML
MIV'07 Proceedings of the 7th Conference on 7th WSEAS International Conference on Multimedia, Internet & Video Technologies - Volume 7
Video linkage: group based copied video detection
CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
Social recommendations of content and metadata
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
The impact of parameter setup on a genetic programming approach to record deduplication
SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Semantics and evaluation of top-k queries in probabilistic databases
Distributed and Parallel Databases
Optimal Stopping: A Record-Linkage Approach
Journal of Data and Information Quality (JDIQ)
A strategy for allowing meaningful and comparable scores in approximate matching
Information Systems
A strategy for allowing meaningful and comparable scores in approximate matching
Information Systems
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Reasoning about record matching rules
Proceedings of the VLDB Endowment
XML: some papers in a haystack
ACM SIGMOD Record
Approximate entity extraction in temporal databases
World Wide Web
Ontology and instance matching
Knowledge-driven multimedia information extraction and ontology evolution
Privacy preserving group linkage
SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Dynamic constraints for record matching
The VLDB Journal — The International Journal on Very Large Data Bases
Efficient processing of ranked queries with sweeping selection
PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
A machine learning approach for instance matching based on similarity metrics
ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Information Systems
Hi-index | 0.00 |
Data Cleaning is an important process that has been at the center of research interest in recent years. An important end goal of effective data cleaning is to identify the relational tuple or tuples that are "most related" to a given query tuple. Various techniques have been proposed in the literature for efficiently identifying approximate matches to a query string against a single attribute of a relation. In addition to constructing a ranking (i.e., ordering) of these matches, the techniques often associate, with each match, scores that quantify the extent of the match. Since multiple attributes could exist in the query tuple, issuing approximate match operations for each of them separately will effectively create a number of ranked lists of the relation tuples. Merging these lists to identify a final ranking and scoring, and returning the top-K tuples, is a challenging task. In this paper, we adapt the well-known footrule distance (for merging ranked lists) to effectively deal with scores. We study efficient algorithms to merge rankings, and produce the top-K tuples, in a declarative way. Since techniques for approximately matching a query string against a single attribute in a relation are typically best deployed in a database, we introduce and describe two novel algorithms for this problem and we provide SQL specifications for them. Our experimental case study, using real application data along with a realization of our proposed techniques on a commercial data base system, highlights the benefits of the proposed algorithms and attests to the overall effectiveness and practicality of our approach.