Iterative record linkage for cleaning and integration

Authors:
Indrajit Bhattacharya;Lise Getoor
Affiliations:
University of Maryland, College Park, MD;University of Maryland, College Park, MD
Venue:
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Year:
2004

Citing 14
Cited 53

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
CiteSeer: an automatic citation indexing system

Proceedings of the third ACM conference on Digital libraries
Learning String-Edit Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Exploiting relationships for object consolidation

Proceedings of the 2nd international workshop on Information quality in information systems
Effective and scalable solutions for mixed and split citation problems in digital libraries

Proceedings of the 2nd international workshop on Information quality in information systems
Relational clustering for multi-type entity resolution

MRDM '05 Proceedings of the 4th international workshop on Multi-relational mining
Semantic-integration research in the database community

AI Magazine - Special issue on semantic integration
A Network Analysis Model for Disambiguation of Names in Lists

Computational & Mathematical Organization Theory
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
PRL: A probabilistic relational language

Machine Learning
Link mining: a survey

ACM SIGKDD Explorations Newsletter
Email alias detection using social network analysis

Proceedings of the 3rd international workshop on Link discovery
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
Query-time entity resolution

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Data integration: the teenage years

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Entity resolution in geospatial data integration

GIS '06 Proceedings of the 14th annual ACM international symposium on Advances in geographic information systems
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Adaptive graphical approach to entity resolution

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Web based linkage

Proceedings of the 9th annual ACM international workshop on Web information and data management
Replica identification using genetic programming

Proceedings of the 2008 ACM symposium on Applied computing
Improving the accuracy of entity identification through refinement

Ph.D. '08 Proceedings of the 2008 EDBT Ph.D. workshop
Rule based synonyms for entity extraction from noisy text

Proceedings of the second workshop on Analytics for noisy unstructured text data
Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization

Data & Knowledge Engineering
Structured machine learning: the next ten years

Machine Learning
Probabilistic Entity Linkage for Heterogeneous Information Spaces

CAiSE '08 Proceedings of the 20th international conference on Advanced Information Systems Engineering
A Graph Partitioning Approach to Entity Disambiguation Using Uncertain Information

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Industry-scale duplicate detection

Proceedings of the VLDB Endowment
The impact of parameter setup on a genetic programming approach to record deduplication

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Exploiting context analysis for combining multiple entity resolution systems

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Entity resolution with iterative blocking

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Constraint-based entity matching

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Online collective entity resolution

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Linking social networks on the web with FOAF: a semantic web case study

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Query-time entity resolution

Journal of Artificial Intelligence Research
An integrated framework for de-identifying unstructured medical data

Data & Knowledge Engineering
Entity-aware query processing for heterogeneous data with uncertainty and correlations

Proceedings of the 2009 EDBT/ICDT Workshops
An incremental clustering scheme for data de-duplication

Data Mining and Knowledge Discovery
A graphical method for reference reconciliation

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
On-the-fly entity-aware query processing in the presence of linkage

Proceedings of the VLDB Endowment
Behavior based record linkage

Proceedings of the VLDB Endowment
Entity resolution with evolving rules

Proceedings of the VLDB Endowment
Entity Resolution and Information Quality

Entity Resolution and Information Quality
Ontology and instance matching

Knowledge-driven multimedia information extraction and ontology evolution
Public record aggregation using semi-supervised entity resolution

Proceedings of the 13th International Conference on Artificial Intelligence and Law
Meta similarity

Applied Intelligence
Object identification with attribute-mediated dependences

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Efficient semantic-aware detection of near duplicate resources

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
Analysing social networks within bibliographical data

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
On the decidability and complexity of identity knowledge representation

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
An evolutionary approach to complex schema matching

Information Systems
Entity disambiguation in anonymized graphs using graph kernels

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Effective string processing and matching for author disambiguation

Proceedings of the 2013 KDD Cup 2013 Workshop
Automated discovery of multi-faceted ontologies for accurate query answering and future semantic reasoning

Data & Knowledge Engineering
Efficient entity matching using materialized lists

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Record linkage, the problem of determining when two records refer to the same entity, has applications for both data cleaning (deduplication) and for integrating data from multiple sources. Traditional approaches use a similarity measure that compares tuples' attribute values; tuples with similarity scores above a certain threshold are declared to be matches. While this method can perform quite well in many domains, particularly domains where there is not a large amount of noise in the data, in some domains looking only at tuple values is not enough. By also examining the context of the tuple, i.e. the other tuples to which it is linked, we can come up with a more accurate linkage decision. But this additional accuracy comes at a price. In order to correctly find all duplicates, we may need to make multiple passes over the data; as linkages are discovered, they may in turn allow us to discover additional linkages. We present results that illustrate the power and feasibility of making use of join information when comparing records.