VAGUE: a user interface to relational databases that permits vague queries
ACM Transactions on Information Systems (TOIS)
The score-distributional threshold optimization for adaptive binary classification tasks
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Maximum likelihood estimation for filtering thresholds
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval
Using Automatic Process Clustering for Design Recovery and Distributed Debugging
IEEE Transactions on Software Engineering
Text joins in an RDBMS for web data integration
WWW '03 Proceedings of the 12th international conference on World Wide Web
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Generalization Error Bounds for Threshold Decision Lists
The Journal of Machine Learning Research
A near-optimal similarity join algorithm and performance evaluation
Information Sciences—Informatics and Computer Science: An International Journal
Introduction to Data Mining, (First Edition)
Introduction to Data Mining, (First Edition)
Reasoning About Approximate Match Query Results
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Adaptive Name Matching in Information Integration
IEEE Intelligent Systems
Hierarchical clustering of mixed data based on distance hierarchy
Information Sciences: an International Journal
Efficient query evaluation on probabilistic databases
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Clustering Categorical Data Using Silhouette Coefficient as a Relocating Measure
ICCIMA '07 Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) - Volume 02
SimEval: a tool for evaluating the quality of similarity functions
ER '07 Tutorials, posters, panels and industrial contributions at the 26th international conference on Conceptual modeling - Volume 83
Introduction to Information Retrieval
Introduction to Information Retrieval
Determining the best K for clustering transactional datasets: A coverage density-based approach
Data & Knowledge Engineering
On comparing two sequences of numbers and its applications to clustering analysis
Information Sciences: an International Journal
Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
A comparison of extrinsic clustering evaluation metrics based on formal constraints
Information Retrieval
Evaluation of novelty metrics for sentence-level novelty mining
Information Sciences: an International Journal
Estimating recall and precision for vague queries in databases
CAiSE'05 Proceedings of the 17th international conference on Advanced Information Systems Engineering
Information Sciences: an International Journal
Hi-index | 0.07 |
Several advanced data management applications, such as data integration, data deduplication, and similarity querying rely on the application of similarity functions. A similarity function requires the definition of a threshold value in order to decide whether two different data instances match, i.e., if they represent the same real world object. In this context, threshold definition is a central problem. This paper proposes a method for estimating the quality of a similarity function. Quality is measured in terms of recall and precision calculated at several different thresholds. Based on the results of the proposed estimation process and the requirements of a specific application, a user is able to choose a suitable threshold value. The estimation process is based on a clustering phase performed over a data collection (or a sample thereof) and requires no human intervention since the choice of similarity threshold is based on the silhouette coefficient, which is an internal quality measure for clusters. An extensive set of experiments on artificial and real datasets demonstrates the effectiveness of the proposed approach. The results of the experiments show that in most cases the estimation error was below 10% in terms of precision and recall.