A cluster-based approach to XML similarity joins

Authors:
Leonardo A. Ribeiro;Theo Härder;Fernanda S. Pimenta
Affiliations:
University of Kaiserslautern, Germany;University of Kaiserslautern, Germany;UFRGS -- Institute of Informatics, Porto Alegre, Brazil
Venue:
IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Year:
2009

Citing 29
Cited 2

Algorithms for clustering data

Algorithms for clustering data
On the editing distance between unordered labeled trees

Information Processing Letters
Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
XClust: clustering XML schemas for effective integration

Proceedings of the eleventh international conference on Information and knowledge management
Combining document representations for known-item search

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Searching XML documents via XML fragments

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A bag of paths model for measuring structural similarity in Web documents

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Fast Detection of XML Structural Similarity

IEEE Transactions on Knowledge and Data Engineering
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximate matching of hierarchical data using pq-grams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Integrating XML data sources using approximate joins

ACM Transactions on Database Systems (TODS)
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Articulating information needs in XML query languages

ACM Transactions on Information Systems (TOIS)
An efficient infrastructure for native transactional XML processing

Data & Knowledge Engineering
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Benchmarking declarative approximate selection predicates

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Xproj: a framework for projected structural clustering of xml documents

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Comparison of Complete and Elementless Native Storage of XML Documents

IDEAS '07 Proceedings of the 11th International Database Engineering and Applications Symposium
Measuring the structural similarity of semistructured documents using entropy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Clustering the tagged web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Approximate Joins for Data-Centric XML

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Fast Indexes and Algorithms for Set Similarity Selection Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
A Decade of XML Data Management: An Industrial Experience Report from Oracle

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Overview of the INEX 2008 XML Mining Track

Advances in Focused Retrieval

Ingredients for accurate, fast, and robust XML similarity joins

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
Leveraging the storage layer to support XML similarity joins in XDBMSs

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

A natural consequence of the widespread adoption of XML as standard for information representation and exchange is the redundant storage of large amounts of persistent XML documents. Compared to relational data tables, data represented in XML format can potentially be even more sensitive to data quality issues because structure, besides textual information, may cause variations in XML documents representing the same information entity. Therefore, correlating XML documents, which are similar in content an structure, is a fundamental operation. In this paper, we present an effective, flexible, and high-performance XML-based similarity join framework. We exploit structural summaries and clustering concepts to produce compact and high-quality XML document representations: our approach outperforms previous work both in terms of performance and accuracy. In this context, we explore different ways to weigh and combine evidence from textual and structural XML representations. Furthermore, we address user interaction, when the similarity framework is configured for a specific domain, and updatability of clustering information, when new documents enter datasets under consideration. We present a thorough experimental evaluation to validate our techniques in the context of a native XML DBMS.