From information to knowledge: harvesting entities and relationships from web sources
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Entity Resolution and Information Quality
Entity Resolution and Information Quality
SemGen: towards a semantic data generator for benchmarking duplicate detectors
DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Learning phenotype mapping for integrating large genetic data
BioNLP '11 Proceedings of BioNLP 2011 Workshop
A supervised machine learning approach for duplicate detection over gazetteer records
GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
Query relaxation for entity-relationship search
ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
Detecting bug duplicate reports through local references
Proceedings of the 7th International Conference on Predictive Models in Software Engineering
Efficient duplicate detection on cloud using a new signature scheme
WAIM'11 Proceedings of the 12th international conference on Web-age information management
Efficient similarity search: arbitrary similarity measures, arbitrary composition
Proceedings of the 20th ACM international conference on Information and knowledge management
Black swan: augmenting statistics with event data
Proceedings of the 20th ACM international conference on Information and knowledge management
Instance-based 'one-to-some' assignment of similarity measures to attributes
OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part I
The impact of spelling errors on patent search
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
An effective rule miner for instance matching in a web of data
Proceedings of the 21st ACM international conference on Information and knowledge management
Evaluating indeterministic duplicate detection results
SUM'12 Proceedings of the 6th international conference on Scalable Uncertainty Management
Indeterministic Handling of Uncertain Decisions in Deduplication
Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Cost-aware query planning for similarity search
Information Systems
Knowledge harvesting in the big-data era
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Determining the relative accuracy of attributes
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
MFIBlocks: An effective blocking algorithm for entity resolution
Information Systems
A taxonomy of privacy-preserving record linkage techniques
Information Systems
Letting keys and functional dependencies out of the bag
APCCM '13 Proceedings of the Ninth Asia-Pacific Conference on Conceptual Modelling - Volume 143
Discovering linkage points over web data
Proceedings of the VLDB Endowment
Publishing bibliographic data on the Semantic Web using BibBase
Semantic Web - Linked Data for science and education
Hi-index | 0.00 |
With the ever increasing volume of data, data quality problems abound. Multiple, yet different representations of the same real-world objects in data, duplicates, are one of the most intriguing data quality problems. The effects of such duplicates are detrimental; for instance, bank customers can obtain duplicate identities, inventory levels are monitored incorrectly, catalogs are mailed multiple times to the same household, etc. Automatically detecting duplicates is difficult: First, duplicate representations are usually not identical but slightly differ in their values. Second, in principle all pairs of records should be compared, which is infeasible for large volumes of data. This lecture examines closely the two main components to overcome these difficulties: (i) Similarity measures are used to automatically identify duplicates when comparing two records. Well-chosen similarity measures improve the effectiveness of duplicate detection. (ii) Algorithms are developed to perform on very large volumes of data in search for duplicates. Well-designed algorithms improve the efficiency of duplicate detection. Finally, we discuss methods to evaluate the success of duplicate detection. Table of Contents: Data Cleansing: Introduction and Motivation / Problem Definition / Similarity Functions / Duplicate Detection Algorithms / Evaluating Detection Success / Conclusion and Outlook / Bibliography