The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Modern Information Retrieval
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Entity Identification in Database Integration
Proceedings of the Ninth International Conference on Data Engineering
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System
Proceedings of the 27th International Conference on Very Large Data Bases
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Record Linkage in Large Data Sets
DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Finding similar identities among objects from multiple web sources
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Automatic data fusion with HumMer
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Approximately detecting duplicates for streaming data using stable bloom filters
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
An incrementally maintainable index for approximate lookups in hierarchical data
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
eTuner: tuning schema matching software using synthetic scenarios
The VLDB Journal — The International Journal on Very Large Data Bases
Structure-based inference of xml similarity for fuzzy duplicate detection
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Improving the accuracy of entity identification through refinement
Ph.D. '08 Proceedings of the 2008 EDBT Ph.D. workshop
Matching XML documents in highly dynamic applications
Proceedings of the eighth ACM symposium on Document engineering
Industry-scale duplicate detection
Proceedings of the VLDB Endowment
ACM Computing Surveys (CSUR)
Detecting Aggregate Incongruities in XML
DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Improved approximate detection of duplicates for data streams over sliding windows
Journal of Computer Science and Technology
A cluster-based approach to XML similarity joins
IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
The pq-gram distance between ordered labeled trees
ACM Transactions on Database Systems (TODS)
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
"Same, Same but Different" A Survey on Duplicate Detection Methods for Situation Awareness
OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II
Declarative XML data cleaning with XClean
CAiSE'07 Proceedings of the 19th international conference on Advanced information systems engineering
XML: some papers in a haystack
ACM SIGMOD Record
Evaluation of entity resolution approaches on real-world match problems
Proceedings of the VLDB Endowment
Interaction between record matching and data repairing
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Schema mapping with quality assurance for data integration
APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Ingredients for accurate, fast, and robust XML similarity joins
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
Dynamic constraints for record matching
The VLDB Journal — The International Journal on Very Large Data Bases
Duplicate detection through structure optimization
Proceedings of the 20th ACM international conference on Information and knowledge management
Enforcing strictness in integration of dimensions: beyond instance matching
Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
Using ontologies for XML data cleaning
OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems
XML duplicate detection using sorted neighborhoods
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Web Semantics: Science, Services and Agents on the World Wide Web
Proceedings of the 15th International Conference on Extending Database Technology
Survey: An overview on XML similarity: Background, current trends and future directions
Computer Science Review
Proceedings of the 16th International Database Engineering & Applications Sysmposium
Information Systems
MFIBlocks: An effective blocking algorithm for entity resolution
Information Systems
An automatic blocking strategy for XML duplicate detection
ACM SIGAPP Applied Computing Review
Similarity evaluation in XML schema and XLink
Proceedings of the 19th Brazilian symposium on Multimedia and the web
Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Duplicate detection is the problem of detecting different entries in a data source representing the same real-world entity. While research abounds in the realm of duplicate detection in relational data, there is yet little work for duplicates in other, more complex data models, such as XML. In this paper, we present a generalized framework for duplicate detection, dividing the problem into three components: candidate definition defining which objects are to be compared, duplicate definition defining when two duplicate candidates are in fact duplicates, and duplicate detection specifying how to efficiently find those duplicates.Using this framework, we propose an XML duplicate detection method, DogmatiX, which compares XML elements based not only on their direct data values, but also on the similarity of their parents, children, structure, etc. We propose heuristics to determine which of these to choose, as well as a similarity measure specifically geared towards the XML data model. An evaluation of our algorithm using several heuristics validates our approach.