DogmatiX tracks down duplicates in XML

Authors:
Melanie Weis;Felix Naumann
Affiliations:
Humboldt-Universität zu Berlin, Berlin, Germany;Humboldt-Universität zu Berlin, Berlin, Germany
Venue:
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Year:
2005

Citing 11
Cited 36

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Modern Information Retrieval

Modern Information Retrieval
Approximate XML joins

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Entity Identification in Database Integration

Proceedings of the Ninth International Conference on Data Engineering
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Record Linkage in Large Data Sets

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Finding similar identities among objects from multiple web sources

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Automatic data fusion with HumMer

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Approximately detecting duplicates for streaming data using stable bloom filters

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
An incrementally maintainable index for approximate lookups in hierarchical data

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
eTuner: tuning schema matching software using synthetic scenarios

The VLDB Journal — The International Journal on Very Large Data Bases
Structure-based inference of xml similarity for fuzzy duplicate detection

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Relational-style XML query

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Improving the accuracy of entity identification through refinement

Ph.D. '08 Proceedings of the 2008 EDBT Ph.D. workshop
Matching XML documents in highly dynamic applications

Proceedings of the eighth ACM symposium on Document engineering
Industry-scale duplicate detection

Proceedings of the VLDB Endowment
Data fusion

ACM Computing Surveys (CSUR)
Detecting Aggregate Incongruities in XML

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Improved approximate detection of duplicates for data streams over sliding windows

Journal of Computer Science and Technology
A cluster-based approach to XML similarity joins

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
The pq-gram distance between ordered labeled trees

ACM Transactions on Database Systems (TODS)
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
"Same, Same but Different" A Survey on Duplicate Detection Methods for Situation Awareness

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II
Declarative XML data cleaning with XClean

CAiSE'07 Proceedings of the 19th international conference on Advanced information systems engineering
XML: some papers in a haystack

ACM SIGMOD Record
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
Interaction between record matching and data repairing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Schema mapping with quality assurance for data integration

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Ingredients for accurate, fast, and robust XML similarity joins

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
Dynamic constraints for record matching

The VLDB Journal — The International Journal on Very Large Data Bases
Duplicate detection through structure optimization

Proceedings of the 20th ACM international conference on Information and knowledge management
Enforcing strictness in integration of dimensions: beyond instance matching

Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
Using ontologies for XML data cleaning

OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems
XML duplicate detection using sorted neighborhoods

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics

Web Semantics: Science, Services and Agents on the World Wide Web
Towards "intelligent compression" in streams: a biased reservoir sampling based Bloom filter approach

Proceedings of the 15th International Conference on Extending Database Technology
Survey: An overview on XML similarity: Background, current trends and future directions

Computer Science Review
XML class outlier detection

Proceedings of the 16th International Database Engineering & Applications Sysmposium
Comparing top-k XML lists

Information Systems
MFIBlocks: An effective blocking algorithm for entity resolution

Information Systems
An automatic blocking strategy for XML duplicate detection

ACM SIGAPP Applied Computing Review
Similarity evaluation in XML schema and XLink

Proceedings of the 19th Brazilian symposium on Multimedia and the web
Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Duplicate detection is the problem of detecting different entries in a data source representing the same real-world entity. While research abounds in the realm of duplicate detection in relational data, there is yet little work for duplicates in other, more complex data models, such as XML. In this paper, we present a generalized framework for duplicate detection, dividing the problem into three components: candidate definition defining which objects are to be compared, duplicate definition defining when two duplicate candidates are in fact duplicates, and duplicate detection specifying how to efficiently find those duplicates.Using this framework, we propose an XML duplicate detection method, DogmatiX, which compares XML elements based not only on their direct data values, but also on the similarity of their parents, children, structure, etc. We propose heuristics to determine which of these to choose, as well as a similarity measure specifically geared towards the XML data model. An evaluation of our algorithm using several heuristics validates our approach.