Computers and Biomedical Research
A technique for computer detection and correction of spelling errors
Communications of the ACM
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
The Inter-Database Instance Identification Problem in Integrating Autonomous Systems
Proceedings of the Fifth International Conference on Data Engineering
Adaptive Name Matching in Information Integration
IEEE Intelligent Systems
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Bioinformatics
Automatic record linkage using seeded nearest neighbour and support vector machine classification
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning blocking schemes for record linkage
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Bioinformatics
Hi-index | 0.01 |
This paper describes a software tool that reconstructs entire genealogies from data collected from different and heterogeneous sources, including municipal and parish records archived over centuries. The tool exploits a record linkage algorithm relying on a rule-based data matching approach. It applies a general strategy for managing the ambiguities due to missing, imprecise or erroneous input data. The process follows an iterative approach that combines automatic pedigree reconstruction with software-empowered human data revision to improve the quality and the accuracy of the results and to optimize the matching rules. The paper discusses the results obtained by reconstructing the entire genealogy of the population of the Val Borbera, a geographically isolated valley in Northern Italy. The genealogy could be reconstructed from data going back as far as the XVI century. The resulting pedigree includes 75,994 trios, 58.9% of which belonging to a unique big family, reconstructed over 13 generations.