The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
AJAX: an extensible data cleaning tool
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A knowledge-based approach for duplicate elimination in data cleaning
Information Systems - Data extraction, cleaning and reconciliation
Potter's Wheel: An Interactive Data Cleaning System
Proceedings of the 27th International Conference on Very Large Data Bases
Attribute-Oriented Induction Using Domain Generalization Graphs
ICTAI '96 Proceedings of the 8th International Conference on Tools with Artificial Intelligence
Exploiting relationships for object consolidation
Proceedings of the 2nd international workshop on Information quality in information systems
Domain-independent data cleaning via analysis of entity-relationship graph
ACM Transactions on Database Systems (TODS)
Survey on test collections and techniques for personal name matching
International Journal of Metadata, Semantics and Ontologies
A graphical method for reference reconciliation
DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
Analysing social networks within bibliographical data
DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Hi-index | 0.00 |
Data quality problems can arise from abbreviations, data entry mistakes, duplicate records, missing fields, and many other sources. Data-cleaning research has focused on duplicate elimination or the merge/purge problem. Another problem is erroneous data called spurious links, where a real-world entity has multiple record links that might not be properly associated with it. One approach to this problem is to use context information to clean up the spurious links. This approach identifies and retrieves the data containing potential spurious links, then performs a context similarity comparison to determine records with high overlaps. The degree of overlapping context indicates the likelihood of spurious links. Experiments on three real-world data sets demonstrate that this approach can correctly identify spurious links and thus assist data cleaning.