Detecting duplicate objects in XML documents
Proceedings of the 2004 international workshop on Information quality in information systems
Web data integration using approximate string join
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Methods for evaluating and creating data quality
Information Systems - Special issue: Data quality in cooperative information systems
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
DogmatiX tracks down duplicates in XML
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Selectivity estimation for fuzzy string predicates in large data sets
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Indexing mixed types for approximate retrieval
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Automatically utilizing secondary sources to align information across sources
AI Magazine - Special issue on semantic integration
Domain-independent data cleaning via analysis of entity-relationship graph
ACM Transactions on Database Systems (TODS)
Estimating the selectivity of approximate string queries
ACM Transactions on Database Systems (TODS)
Privacy preserving schema and data matching
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Structure-based inference of xml similarity for fuzzy duplicate detection
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Communications of the ACM
Febrl: a freely available record linkage system with a graphical user interface
HDKM '08 Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80
SEPIA: estimating selectivities of approximate string predicates in large Databases
The VLDB Journal — The International Journal on Very Large Data Bases
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
Disambiguating authors in academic publications using random forests
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Robust record linkage blocking using suffix arrays
Proceedings of the 18th ACM conference on Information and knowledge management
Record linkage performance for large data sets
Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
ACM SIGKDD Explorations Newsletter
Scaling record linkage to non-uniform distributed class sizes
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Generalizing prefix filtering to improve set similarity joins
Information Systems
A multilevel and domain-independent duplicate detection model for scientific database
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Evaluating entity resolution results
Proceedings of the VLDB Endowment
Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters
ACM Transactions on Knowledge Discovery from Data (TKDD)
Efficient entity resolution for large heterogeneous information spaces
Proceedings of the fourth ACM international conference on Web search and data mining
Detecting and exploiting stability in evolving heterogeneous information spaces
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Transactions on large-scale data- and knowledge-centered systems III
PG-join: proximity graph based string similarity joins
SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Effective early termination techniques for text similarity join operator
ISCIS'05 Proceedings of the 20th international conference on Computer and Information Sciences
Decision models for record linkage
Data Mining
Multiple valued logic approach for matching patient records in multiple databases
Journal of Biomedical Informatics
Efficient and Practical Approach for Private Record Linkage
Journal of Data and Information Quality (JDIQ)
An automatic blocking mechanism for large-scale de-duplication tasks
Proceedings of the 21st ACM international conference on Information and knowledge management
Adaptive Connection Strength Models for Relationship-Based Entity Resolution
Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Towards scalable real-time entity resolution using a similarity-aware inverted index approach
AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
MFIBlocks: An effective blocking algorithm for entity resolution
Information Systems
A taxonomy of privacy-preserving record linkage techniques
Information Systems
A distributed framework for scaling Up LSH-based computations in privacy preserving record linkage
Proceedings of the 6th Balkan Conference in Informatics
Hi-index | 0.02 |
This paper describes an efficient approach to record linkage. Given two lists of records, the record-linkage problemconsists of determining all pairs that are similar to eachother, where the overall similarity between two records isdefined based on domain-specific similarities over individual attributes constituting the record. The record-linkageproblem arises naturally in the context of data cleansingthat usually precedes data analysis and mining. We explore a novel approach to this problem. For each attribute of records, we first map values to a multidimensionalEuclidean space that preserves domain-specific similarity.Many mapping algorithms can be applied, and we use theFastMap approach as an example. Given the merging rulethat defines when two records are similar, a set of attributesare chosen along which the merge will proceed. A multidimensional similarity join over the chosen attributes is usedto determine similar pairs of records. Our extensive experiments using real data sets show that our solution has verygood efficiency and accuracy.