Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
TAILOR: A Record Linkage Tool Box
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Text classification from positive and unlabeled documents
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Methods for evaluating and creating data quality
Information Systems - Special issue: Data quality in cooperative information systems
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data
ACM Transactions on Knowledge Discovery from Data (TKDD)
Towards automated record linkage
AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
A two-step classification approach to unsupervised record linkage
AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
Febrl: a freely available record linkage system with a graphical user interface
HDKM '08 Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80
Automatic training example selection for scalable unsupervised record linkage
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Probabilistic data generation for deduplication and data linkage
IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning
Decision models for record linkage
Data Mining
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Geocode Matching and Privacy Preservation
Privacy, Security, and Trust in KDD
Generic Entity Resolution in Relational Databases
ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
Privacy and anonymization for very large datasets
Proceedings of the 18th ACM conference on Information and knowledge management
ACM SIGKDD Explorations Newsletter
A community question-answering refinement system
Proceedings of the 22nd ACM conference on Hypertext and hypermedia
Ontology and instance matching
Knowledge-driven multimedia information extraction and ontology evolution
Fusion of similarity measures for time series classification
HAIS'11 Proceedings of the 6th international conference on Hybrid artificial intelligent systems - Volume Part II
Personalized book recommendations created by using social media data
WISS'10 Proceedings of the 2010 international conference on Web information systems engineering
Computer-based genealogy reconstruction in founder populations
Journal of Biomedical Informatics
Multiple valued logic approach for matching patient records in multiple databases
Journal of Biomedical Informatics
Flexible and efficient distributed resolution of large entities
FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Automatic SLA Matching and Provider Selection in Grid and Cloud Computing Markets
GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
A group recommender for movies based on content similarity and popularity
Information Processing and Management: an International Journal
A taxonomy of privacy-preserving record linkage techniques
Information Systems
Data & Knowledge Engineering
Hi-index | 0.00 |
The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific data. The aim of linking is to match and aggregate all records that refer to the same entity. One of the major challenges when linking large databases is the efficient and accurate classification of record pairs into matches and non-matches. While traditionally classification was based on manually-set thresholds or on statistical procedures, many of the more recently developed classification methods are based on supervised learning techniques. They therefore require training data, which is often not available in real world situations or has to be prepared manually, an expensive, cumbersome and time-consuming process. The author has previously presented a novel two-step approach to automatic record pair classification [6, 7]. In the first step of this approach, training examples of high quality are automatically selected from the compared record pairs, and used in the second step to train a support vector machine (SVM) classifier. Initial experiments showed the feasibility of the approach, achieving results that outperformed k-means clustering. In this paper, two variations of this approach are presented. The first is based on a nearest-neighbour classifier, while the second improves a SVM classifier by iteratively adding more examples into the training sets. Experimental results show that this two-step approach can achieve better classification results than other unsupervised approaches.