Machine Learning
IEEE Transactions on Pattern Analysis and Machine Intelligence
Automating the approximate record-matching process
Information Sciences—Informatics and Computer Science: An International Journal
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Learning object identification rules for information integration
Information Systems - Data extraction, cleaning and reconciliation
Information Sciences: an International Journal
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatically utilizing secondary sources to align information across sources
AI Magazine - Special issue on semantic integration
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Introduction to Semi-Supervised Learning
Introduction to Semi-Supervised Learning
On active learning of record matching packages
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Controlling false match rates in record linkage using extreme value theory
Journal of Biomedical Informatics
Hi-index | 0.00 |
Introduction: Supervised record linkage methods often require a clerical review to gain informative training data. Active learning means to actively prompt the user to label data with special characteristics in order to minimise the review costs. We conducted an empirical evaluation to investigate whether a simple active learning strategy using binary comparison patterns is sufficient or if string metrics together with a more sophisticated algorithm are necessary to achieve high accuracies with a small training set. Material and Methods: Based on medical registry data with different numbers of attributes, we used active learning to acquire training sets for classification trees, which were then used to classify the remaining data. Active learning for binary patterns means that every distinct comparison pattern represents a stratum from which one item is sampled. Active learning for patterns consisting of the Levenshtein string metric values uses an iterative process where the most informative and representative examples are added to the training set. In this context, we extended the active learning strategy by Sarawagi and Bhamidipaty (2002) [6]. Results: On the original data set, active learning based on binary comparison patterns leads to the best results. When dropping four or six attributes, using string metrics leads to better results. In both cases, not more than 200 manually reviewed training examples are necessary. Conclusions: In record linkage applications where only forename, name and birthday are available as attributes, we suggest the sophisticated active learning strategy based on string metrics in order to achieve highly accurate results. We recommend the simple strategy if more attributes are available, as in our study. In both cases, active learning significantly reduces the amount of manual involvement in training data selection compared to usual record linkage settings.