The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Entity Identification in Database Integration
Proceedings of the Ninth International Conference on Data Engineering
Finding similar identities among objects from multiple web sources
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Detecting duplicate objects in XML documents
Proceedings of the 2004 international workshop on Information quality in information systems
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
DogmatiX tracks down duplicates in XML
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Profile-Based Object Matching for Information Integration
IEEE Intelligent Systems
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A precise blocking method for record linkage
DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
Structure-based inference of xml similarity for fuzzy duplicate detection
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Retrieving XML data from heterogeneous sources through vague querying
ACM Transactions on Internet Technology (TOIT)
Detecting Aggregate Incongruities in XML
DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
The pq-gram distance between ordered labeled trees
ACM Transactions on Database Systems (TODS)
Graph-based concept identification and disambiguation for enterprise search
Proceedings of the 19th international conference on World wide web
XML: some papers in a haystack
ACM SIGMOD Record
Duplicate detection through structure optimization
Proceedings of the 20th ACM international conference on Information and knowledge management
Efficient XML duplicate detection using an adaptive two-level optimization
Proceedings of the 28th Annual ACM Symposium on Applied Computing
An automatic blocking strategy for XML duplicate detection
ACM SIGAPP Applied Computing Review
Hi-index | 0.02 |
Detecting duplicates is a problem with a long tradition in many domains, such as customer relationship management and data warehousing. The problem is twofold: First define a suitable similarity measure, and second efficiently apply the measure to all pairs of objects. With the advent and pervasion of the XML data model, it is necessary to find new similarity measures and to develop efficient methods to detect duplicate elements in nested XML data. A classical approach to duplicate detection in flat relational data is the sorted neighborhood method, which draws its efficiency from sliding a window over the relation and comparing only tuples within that window. We extend the algorithm to cover not only a single relation but nested XML elements. To compare objects we make use of XML parent and child relationships. For efficiency, we apply the windowing technique in a bottom-up fashion, detecting duplicates at each level of the XML hierarchy. Experiments show a speedup comparable to the original method data and they show the high effectiveness of our algorithm in detecting XML duplicates.