VAGUE: a user interface to relational databases that permits vague queries
ACM Transactions on Information Systems (TOIS)
IEEE Transactions on Pattern Analysis and Machine Intelligence
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning object identification rules for information integration
Information Systems - Data extraction, cleaning and reconciliation
Modern Information Retrieval
Top-k selection queries over relational databases: Mapping strategies and performance evaluation
ACM Transactions on Database Systems (TODS)
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration
WWW '03 Proceedings of the 12th international conference on World Wide Web
Entity Matching in Heterogeneous Databases: A Distance Based Decision Model
HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences-Volume 7 - Volume 7
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Measuring similarity between collection of values
Proceedings of the 6th annual ACM international workshop on Web information and data management
Reasoning About Approximate Match Query Results
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Adaptive Name Matching in Information Integration
IEEE Intelligent Systems
Profile-Based Object Matching for Information Integration
IEEE Intelligent Systems
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Merging the results of approximate match operations
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient query evaluation on probabilistic databases
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Structure-based inference of xml similarity for fuzzy duplicate detection
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Example-driven design of efficient record matching queries
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Estimating recall and precision for vague queries in databases
CAiSE'05 Proceedings of the 17th international conference on Advanced Information Systems Engineering
Hi-index | 0.00 |
Approximate data matching aims at assessing whether two distinct instances of data represent the same real-world object. The comparison between data values is usually done by applying a similarity function which returns a similarity score. If this score surpasses a given threshold, both data instances are considered as representing the same real-world object. These score values depend on the algorithm that implements the function and have no meaning to the user. In addition, score values generated by different functions are not comparable. This will potentially lead to problems when the scores returned by different similarity functions need to be combined for computing the similarity between records. In this article, we propose that thresholds should be defined in terms of the precision that is expected from the matching process rather than in terms of the raw scores returned by the similarity function. Precision is a widely known similarity metric and has a clear interpretation from the user's point of view. Our approach defines mappings from score values to precision values, which we call adjusted scores. In order to obtain such mappings, our approach requires training over a small dataset. Experiments show that training can be reused for different datasets on the same domain. Our results also demonstrate that existing methods for combining scores for computing the similarity between records may be enhanced if adjusted scores are used.