A Fast Linkage Detection Scheme for Multi-Source Information Integration

Authors:
Akiko Aizawa;Keizo Oyama
Affiliations:
National Intsitute of Informatics, The Graduate University for Advanced Studies Hitotsubashi, Chiyoda-ku, Japan;National Intsitute of Informatics, The Graduate University for Advanced Studies Hitotsubashi, Chiyoda-ku, Japan
Venue:
WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Year:
2005

Citing 0
Cited 15

Adaptive sorted neighborhood methods for efficient record linkage

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Febrl: a freely available record linkage system with a graphical user interface

HDKM '08 Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Robust record linkage blocking using suffix arrays

Proceedings of the 18th ACM conference on Information and knowledge management
Similarity-aware indexing for real-time entity resolution

Proceedings of the 18th ACM conference on Information and knowledge management
Development and user experiences of an open source data cleaning, deduplication and record linkage system

ACM SIGKDD Explorations Newsletter
An efficient duplicate record detection using q-grams array inverted index

DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters

ACM Transactions on Knowledge Discovery from Data (TKDD)
Efficient entity resolution for large heterogeneous information spaces

Proceedings of the fourth ACM international conference on Web search and data mining
A fast approach for parallel deduplication on multicore processors

Proceedings of the 2011 ACM Symposium on Applied Computing
A sequence labeling method using syntactical and textual patterns for record linkage

ICAPR'05 Proceedings of the Third international conference on Advances in Pattern Recognition - Volume Part I
Towards scalable real-time entity resolution using a similarity-aware inverted index approach

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
MFIBlocks: An effective blocking algorithm for entity resolution

Information Systems
A taxonomy of privacy-preserving record linkage techniques

Information Systems
Multi-source learning with block-wise missing data for Alzheimer's disease prediction

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Record linkage refers to techniques for identifying records associated with the same real-world entities. Record linkage is not only crucial in integrating multi-source databases that have been generated independently, but is also considered to be one of the key issues in integrating heterogeneous Web resources. However, when targeting large-scale data, the cost of enumerating all the possible linkages often becomes impracticably high. Based on this background, this paper proposes a fast and efficient method for linkage detection. The features of the proposed approach are: first, it exploits a suffix array structure that enables linkage detection using variable length n-grams. Second, it dynamically generates blocks of possibly associated records using 'blocking keys' extracted from already known reliable linkages. The results from our preliminary experiments where the proposed method was applied to the integration of four bibliographic databases, which scale up to more than 10 million records, are also reported in the paper.