Pattern matching algorithms
Hardening soft information sources
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
The Inter-Database Instance Identification Problem in Integrating Autonomous Systems
Proceedings of the Fifth International Conference on Data Engineering
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding similar identities among objects from multiple web sources
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
A bag of paths model for measuring structural similarity in Web documents
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
DogmatiX tracks down duplicates in XML
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Integrating XML data sources using approximate joins
ACM Transactions on Database Systems (TODS)
Adaptive Name Matching in Information Integration
IEEE Intelligent Systems
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming and Delivering Data
Structure-based inference of xml similarity for fuzzy duplicate detection
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Analysis of tree edit distance algorithms
CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Semantic-based Merging of RSS Items
World Wide Web
XML data clustering: An overview
ACM Computing Surveys (CSUR)
Duplicate detection through structure optimization
Proceedings of the 20th ACM international conference on Information and knowledge management
Survey: An overview on XML similarity: Background, current trends and future directions
Computer Science Review
Similarity evaluation in XML schema and XLink
Proceedings of the 19th Brazilian symposium on Multimedia and the web
Hi-index | 0.02 |
Highly dynamic applications like the Web and peer-to-peer systems require a great deal of effort in document management. Documents from different sources may contain parts that, although having different structure or different contents, may be considered as representing the same conceptual information. One essential task in this scenario is the identification of complementary or overlapping documents that need to be integrated. In this paper, we deal specifically with documents represented in the XML format. XML document integration is an important process in highly dynamic applications, for the volume of data available in this format is constantly growing. XML integration is also a challenging task, due to the flexible nature of XML, which may lead to structure divergences and content conflicts between the documents. In this work, we present a novel approach to the matching problem, i.e., the problem of defining which parts of two documents contain the same information. Matching is usually the first step of an integration process. Our approach is novel in the sense it combines similarity information from the content of the elements with information from the structure of the documents. This feature, as our experiments confirm, makes our approach capable of dealing with content as well as structural divergences.