Probabilistic reasoning in intelligent systems: networks of plausible inference
Probabilistic reasoning in intelligent systems: networks of plausible inference
Inference networks for document retrieval
SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Relevance feedback and inference networks
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization
Proceedings of the seventh international conference on Information and knowledge management
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Record Linkage in Large Data Sets
DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting duplicate objects in XML documents
Proceedings of the 2004 international workshop on Information quality in information systems
Robust Identification of Fuzzy Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
DogmatiX tracks down duplicates in XML
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Detecting Duplicates in Complex XML Data
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Domain-independent data cleaning via analysis of entity-relationship graph
ACM Transactions on Database Systems (TODS)
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
LinkClus: efficient clustering via heterogeneous semantic links
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Object identification with attribute-mediated dependences
PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
XML duplicate detection using sorted neighborhoods
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Matching XML documents in highly dynamic applications
Proceedings of the eighth ACM symposium on Document engineering
Retrieving XML data from heterogeneous sources through vague querying
ACM Transactions on Internet Technology (TOIT)
A strategy for allowing meaningful and comparable scores in approximate matching
Information Systems
A strategy for allowing meaningful and comparable scores in approximate matching
Information Systems
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
XML data clustering: An overview
ACM Computing Surveys (CSUR)
Duplicate detection through structure optimization
Proceedings of the 20th ACM international conference on Information and knowledge management
Web Semantics: Science, Services and Agents on the World Wide Web
Survey: An overview on XML similarity: Background, current trends and future directions
Computer Science Review
An automatic blocking strategy for XML duplicate detection
ACM SIGAPP Applied Computing Review
Hi-index | 0.00 |
Fuzzy duplicate detection aims at identifying multiple representations of real-world objects stored in a data source, and is a task of critical practical relevance in data cleaning, data mining, or data integration. It has a long history for relational data stored in a single table (or in multiple tables with equal schema). Algorithms for fuzzy duplicate detection in more complex structures, e.g., hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms use similarity measures that consider the duplicate status of their direct neighbors, e.g., children in hierarchical data, to improve duplicate detection effectiveness. In this paper, we propose a novel method for fuzzy duplicate detection in hierarchical and semi-structured XML data. Unlike previous approaches, it not only considers the duplicate status of children, but rather the probability of descendants being duplicates. Probabilities are computed efficiently using a Bayesian network. Experiments show the proposed algorithm is able to maintain high precision and recall values, even when dealing with data containing a high amount of errors and missing information. Our proposal is also able to outperform a state-of-the-art duplicate detection system on three different XML databases.