Domain-independent data cleaning via analysis of entity-relationship graph

Authors:
Dmitri V. Kalashnikov;Sharad Mehrotra
Affiliations:
University of California, Irvine, Irvine, CA;University of California, Irvine, Irvine, CA
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2006

Citing 34
Cited 42

Statistical analysis with missing data

Statistical analysis with missing data
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Learning String-Edit Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Approximating matrix multiplication for pattern recognition tasks

Journal of Algorithms
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Hardening soft information sources

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Database Systems: The Complete Book

Database Systems: The Complete Book
Three companions for data mining in first order logic

Relational Data Mining
Introduction to Algorithms

Introduction to Algorithms
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Bayesian decision model for cost optimal record matching

The VLDB Journal — The International Journal on Very Large Data Bases
Similarity Join for Low-and High-Dimensional Data

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Efficient Record Linkage in Large Data Sets

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Evaluating probabilistic queries over imprecise data

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Keyword Searching and Browsing in Databases using BANKS

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Algorithms for estimating relative importance in networks

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
Cleaning the Spurious Links in Data

IEEE Intelligent Systems
Iterative record linkage for cleaning and integration

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Fast discovery of connection subgraphs

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Querying Imprecise Data in Moving Object Environments

IEEE Transactions on Knowledge and Data Engineering
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Data cleaning in microsoft SQL server 2005

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Exploiting relationships for object consolidation

Proceedings of the 2nd international workshop on Information quality in information systems
Fast similarity join for multi-dimensional data

Information Systems
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Identification and tracing of ambiguous names: discriminative and generative approaches

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence

Adaptive graphical approach to entity resolution

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Leveraging semantic technologies for enterprise search

Proceedings of the ACM first Ph.D. workshop in CIKM
Structure-based inference of xml similarity for fuzzy duplicate detection

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Towards breaking the quality curse.: a web-querying approach to web people search.

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic Entity Linkage for Heterogeneous Information Spaces

CAiSE '08 Proceedings of the 20th international conference on Advanced Information Systems Engineering
Industry-scale duplicate detection

Proceedings of the VLDB Endowment
Scaling up duplicate detection in graph data

Proceedings of the 17th ACM conference on Information and knowledge management
Social recommendations of content and metadata

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Reconciliando dados de cunho acadêmico

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
A Term-Based Driven Clustering Approach for Name Disambiguation

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Author name disambiguation in MEDLINE

ACM Transactions on Knowledge Discovery from Data (TKDD)
Exploiting context analysis for combining multiple entity resolution systems

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Intelligent hybrid approach to false identity detection

Proceedings of the 12th International Conference on Artificial Intelligence and Law
Overcoming Schema Heterogeneity between Linked Semantic Repositories to Improve Coreference Resolution

ASWC '09 Proceedings of the 4th Asian Conference on The Semantic Web
Entity-aware query processing for heterogeneous data with uncertainty and correlations

Proceedings of the 2009 EDBT/ICDT Workshops
Efficient and scalable multi-geography route planning

Proceedings of the 13th International Conference on Extending Database Technology
Leveraging personal metadata for Desktop search: The Beagle++ system

Web Semantics: Science, Services and Agents on the World Wide Web
Interweaving OAI-PMH data sources with the linked data cloud

International Journal of Metadata, Semantics and Ontologies
Self-tuning in graph-based reference disambiguation

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
K-radius subgraph comparison for RDF data cleansing

WAIM'10 Proceedings of the 11th international conference on Web-age information management
On Graph-Based Name Disambiguation

Journal of Data and Information Quality (JDIQ)
On-the-fly entity-aware query processing in the presence of linkage

Proceedings of the VLDB Endowment
Efficient entity resolution for large heterogeneous information spaces

Proceedings of the fourth ACM international conference on Web search and data mining
Eliminating the redundancy in blocking-based entity resolution methods

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Efficient name disambiguation in digital libraries

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Duplicate detection through structure optimization

Proceedings of the 20th ACM international conference on Information and knowledge management
Attribute and object selection queries on objects with probabilistic attributes

ACM Transactions on Database Systems (TODS)
Exploiting Web querying for Web people search

ACM Transactions on Database Systems (TODS)
Quality-aware similarity assessment for entity matching in Web data

Information Systems
Efficient semantic-aware detection of near duplicate resources

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies
Domain-Independent Entity Coreference for Linking Ontology Instances

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Adaptive Connection Strength Models for Relationship-Based Entity Resolution

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Towards scalable real-time entity resolution using a similarity-aware inverted index approach

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
Studying User Footprints in Different Online Social Networks

ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)
A taxonomy of privacy-preserving record linkage techniques

Information Systems
A supervised learning and group linking method for historical census household linkage

AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Super-EGO: fast multi-dimensional similarity join

The VLDB Journal — The International Journal on Very Large Data Bases
De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies
Query-driven approach to entity resolution

Proceedings of the VLDB Endowment
Robust hybrid name disambiguation framework for large databases

Scientometrics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RelDC) and the traditional techniques is that RelDC analyzes not only object features but also inter-object relationships to improve the disambiguation quality. Our extensive experiments over two real data sets and over synthetic datasets show that analysis of relationships significantly improves quality of the result.