Matching XML documents in highly dynamic applications

Authors:
Adrovane M. Kade;Carlos A. Heuser
Affiliations:
Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil;Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
Venue:
Proceedings of the eighth ACM symposium on Document engineering
Year:
2008

Citing 14
Cited 6

Tree pattern matching

Pattern matching algorithms
Hardening soft information sources

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
The Inter-Database Instance Identification Problem in Integrating Autonomous Systems

Proceedings of the Fifth International Conference on Data Engineering
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding similar identities among objects from multiple web sources

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
A bag of paths model for measuring structural similarity in Web documents

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Integrating XML data sources using approximate joins

ACM Transactions on Database Systems (TODS)
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming and Delivering Data

The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming and Delivering Data
Structure-based inference of xml similarity for fuzzy duplicate detection

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Analysis of tree edit distance algorithms

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching

Relating RSS News/Items

ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Semantic-based Merging of RSS Items

World Wide Web
XML data clustering: An overview

ACM Computing Surveys (CSUR)
Duplicate detection through structure optimization

Proceedings of the 20th ACM international conference on Information and knowledge management
Survey: An overview on XML similarity: Background, current trends and future directions

Computer Science Review
Similarity evaluation in XML schema and XLink

Proceedings of the 19th Brazilian symposium on Multimedia and the web

Quantified Score

Hi-index	0.02

Visualization

Abstract

Highly dynamic applications like the Web and peer-to-peer systems require a great deal of effort in document management. Documents from different sources may contain parts that, although having different structure or different contents, may be considered as representing the same conceptual information. One essential task in this scenario is the identification of complementary or overlapping documents that need to be integrated. In this paper, we deal specifically with documents represented in the XML format. XML document integration is an important process in highly dynamic applications, for the volume of data available in this format is constantly growing. XML integration is also a challenging task, due to the flexible nature of XML, which may lead to structure divergences and content conflicts between the documents. In this work, we present a novel approach to the matching problem, i.e., the problem of defining which parts of two documents contain the same information. Matching is usually the first step of an integration process. Our approach is novel in the sense it combines similarity information from the content of the elements with information from the structure of the documents. This feature, as our experiments confirm, makes our approach capable of dealing with content as well as structural divergences.