The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Normalized Cuts and Image Segmentation
IEEE Transactions on Pattern Analysis and Machine Intelligence
Database Systems: The Complete Book
Database Systems: The Complete Book
Three companions for data mining in first order logic
Relational Data Mining
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Bayesian decision model for cost optimal record matching
The VLDB Journal — The International Journal on Very Large Data Bases
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Algorithms for estimating relative importance in networks
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Kernel Methods for Pattern Analysis
Kernel Methods for Pattern Analysis
Cleaning the Spurious Links in Data
IEEE Intelligent Systems
Iterative record linkage for cleaning and integration
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Fast discovery of connection subgraphs
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A Mathematical Theory of Communication
A Mathematical Theory of Communication
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Domain-independent data cleaning via analysis of entity-relationship graph
ACM Transactions on Database Systems (TODS)
Adaptive graphical approach to entity resolution
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Improving the accuracy of entity identification through refinement
Ph.D. '08 Proceedings of the 2008 EDBT Ph.D. workshop
Consolidation of References to Persons in Bibliographic Databases
ICADL 08 Proceedings of the 11th International Conference on Asian Digital Libraries: Universal and Ubiquitous Access to Information
Exploiting context analysis for combining multiple entity resolution systems
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Self-tuning in graph-based reference disambiguation
DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
Multiple relationship based deduplication
Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database Research
A graphical method for reference reconciliation
DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
Duplicate detection through structure optimization
Proceedings of the 20th ACM international conference on Information and knowledge management
Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine
Web Semantics: Science, Services and Agents on the World Wide Web
Exploiting Web querying for Web people search
ACM Transactions on Database Systems (TODS)
Web Semantics: Science, Services and Agents on the World Wide Web
Linking records in dynamic world
PhD '12 Proceedings of the on SIGMOD/PODS 2012 PhD Symposium
Adaptive Connection Strength Models for Relationship-Based Entity Resolution
Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Hi-index | 0.00 |
Data mining practitioners frequently have to spend significant portion of their project time on data preprocessing before they can apply their algorithms on real-world datasets. Such a preprocessing is required because many real-world datasets are not perfect, but rather they contain missing, erroneous, duplicate data and other data cleaning problems. It is a well established fact that, in general, if such problems with data are not corrected, applying data mining algorithm can lead to wrong results. The latter is known as the "garbage in, garbage out" principle. Given the significance of the problem, numerous data cleaning techniques have been designed in the past to address the aforementioned problems with data.In this paper, we address one of the data cleaning challenges, called object consolidation. This important challenge arises because objects in datasets are frequently represented via descriptions (a set of instantiated attributes), which alone might not always uniquely identify the object. The goal of object consolidation is to correctly consolidate (i.e., to group/determine) all the representations of the same object, for each object in the dataset. In contrast to traditional domain-independent data cleaning techniques, our approach analyzes not only object features, but also additional semantic information: inter-objects relationships, for the purpose of object consolidation. The approach views datasets as attributed relational graphs (ARGs) of object representations (nodes), connected via relationships (edges). The approach then applies graph partitioning techniques to accurately cluster object representations. Our empirical study over real datasets shows that analyzing relationships significantly improves the quality of the result.