XML data clustering: An overview

Authors:
Alsayed Algergawy;Marco Mesiti;Richi Nayak;Gunter Saake
Affiliations:
Madgeburg University, Madegeburg, Germany;University of Milano, Milano, Italy;Queensland University of Technology, Brisbane, Australia;Magdeburg University, Magdeburg, Germany
Venue:
ACM Computing Surveys (CSUR)
Year:
2011

Citing 75
Cited 2

A comparative analysis of methodologies for database schema integration

ACM Computing Surveys (CSUR)
Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
Incremental clustering for dynamic information processing

ACM Transactions on Information Systems (TOIS)
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
Data clustering: a review

ACM Computing Surveys (CSUR)
SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks

Data & Knowledge Engineering
ROCK: a robust clustering algorithm for categorical attributes

Information Systems
A vector space model for automatic indexing

Communications of the ACM
Comparative analysis of six XML schema languages

ACM SIGMOD Record
Modern Information Retrieval

Modern Information Retrieval
XClust: clustering XML schemas for effective integration

Proceedings of the eleventh international conference on Information and knowledge management
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
BitCube: A Three-Dimensional Bitmap Indexing for XML Documents

Journal of Intelligent Information Systems
Latent Semantic Kernels

Journal of Intelligent Information Systems
XML and Data Integration

IEEE Internet Computing
Mining Sequential Patterns: Generalizations and Performance Improvements

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Clustering Transactional Data

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Relational Databases for Querying XML Documents: Limitations and Opportunities

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Comparing Hierarchical Data in External Memory

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Generic Schema Matching with Cupid

Proceedings of the 27th International Conference on Very Large Data Bases
Clustering and Visualization of Large Protein Sequence Databases by Means of an Extension on the Self-Organizing Map

DS '00 Proceedings of the Third International Conference on Discovery Science
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

IEEE Transactions on Knowledge and Data Engineering
A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications

Information Systems - Special issue on web data integration
RPE query processing and optimization techniques for XML databases

Journal of Computer Science and Technology
Efficient Disk-Based K-Means Clustering for Relational Databases

IEEE Transactions on Knowledge and Data Engineering
Incremental Clustering and Dynamic Information Retrieval

SIAM Journal on Computing
Measuring similarity between collection of values

Proceedings of the 6th annual ACM international workshop on Web information and data management
Fast Detection of XML Structural Similarity

IEEE Transactions on Knowledge and Data Engineering
On the use of hierarchical information in sequential mining-based XML document similarity computation

Knowledge and Information Systems
XML for Bioinformatics

XML for Bioinformatics
Similarity evaluation on tree-structured data

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A survey on tree edit distance and related problems

Theoretical Computer Science
Schema matching for transforming structured documents

Proceedings of the 2005 ACM symposium on Document engineering
Finding Syntactic Similarities Between XML Documents

DEXA '06 Proceedings of the 17th International Conference on Database and Expert Systems Applications
eTuner: tuning schema matching software using synthetic scenarios

The VLDB Journal — The International Journal on Very Large Data Bases
Querying XML,: XQuery, XPath, and SQL/XML in context (The Morgan Kaufmann Series in Data Management Systems) (The Morgan Kaufmann Series in Data Management Systems)

Querying XML,: XQuery, XPath, and SQL/XML in context (The Morgan Kaufmann Series in Data Management Systems) (The Morgan Kaufmann Series in Data Management Systems)
A clustering method based on path similarities of XML data

Data & Knowledge Engineering
XML schema clustering with semantic and hierarchical similarity measures

Knowledge-Based Systems
Xproj: a framework for projected structural clustering of xml documents

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
COMA: a system for flexible combination of schema matching approaches

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Efficiently Querying Large XML Data Repositories: A Survey

IEEE Transactions on Knowledge and Data Engineering
Structure-based inference of xml similarity for fuzzy duplicate detection

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A novel method for measuring semantic similarity for XML schema matching

Expert Systems with Applications: An International Journal
Fast and effective clustering of XML data using structural information

Knowledge and Information Systems
Measuring the structural similarity among XML documents and DTDs

Journal of Intelligent Information Systems
Schema mapping verification: the spicy way

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
XEdge: clustering homogeneous and heterogeneous XML documents using edge summaries

Proceedings of the 2008 ACM symposium on Applied computing
XML fever

Communications of the ACM - Web science
Introduction to Information Retrieval

Introduction to Information Retrieval
Matching XML documents in highly dynamic applications

Proceedings of the eighth ACM symposium on Document engineering
PORSCHE: Performance ORiented SCHEma mediation

Information Systems
An Entropy-Based Characterization of the Heterogeneity of XML Collections

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
XML Data Integration Based on Content and Structure Similarity Using Keys

OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems:
CONTOUR: an efficient algorithm for discovering discriminating subsequences

Data Mining and Knowledge Discovery
A schema matching-based approach to XML schema clustering

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Learning element similarity matrix for semi-structured document analysis

Knowledge and Information Systems
Improving XML schema matching performance using Prüfer sequences

Data & Knowledge Engineering
A methodology for clustering XML documents by structure

Information Systems
Structural similarity evaluation between XML documents and DTDs

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Semantic matching: algorithms and implementation

Journal on data semantics IX
Transforming XML trees for efficient classification and clustering

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
ArHeX: an approximate retrieval system for highly heterogeneous XML document collections

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Dynamic approach for integrating web data warehouses

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part IV
LAX: an efficient approximate XML join based on clustered leaf nodes for XML data integration

BNCOD'05 Proceedings of the 22nd British National conference on Databases: enterprise, Skills and Innovation
An overview of web data clustering practices

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
Structural similarity mining in semi-structured microarray data for efficient storage construction

OTM'06 Proceedings of the 2006 international conference on On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part I
Survey: An overview on XML similarity: Background, current trends and future directions

Computer Science Review
Web mining in soft computing framework: relevance, state of the art and future directions

IEEE Transactions on Neural Networks
Survey of clustering algorithms

IEEE Transactions on Neural Networks
Combining structure and content similarities for XML document clustering

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87

A change detection system for unordered XML data using a relational model

Data & Knowledge Engineering
Hierarchical clustering of XML documents focused on structural components

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the last few years we have observed a proliferation of approaches for clustering XML documents and schemas based on their structure and content. The presence of such a huge amount of approaches is due to the different applications requiring the clustering of XML data. These applications need data in the form of similar contents, tags, paths, structures, and semantics. In this article, we first outline the application contexts in which clustering is useful, then we survey approaches so far proposed relying on the abstract representation of data (instances or schema), on the identified similarity measure, and on the clustering algorithm. In this presentation, we aim to draw a taxonomy in which the current approaches can be classified and compared. We aim at introducing an integrated view that is useful when comparing XML data clustering approaches, when developing a new clustering algorithm, and when implementing an XML clustering component. Finally, the article moves into the description of future trends and research issues that still need to be faced.