Hierarchical clustering of XML documents focused on structural components

Authors:
Gianni Costa;Giuseppe Manco;Riccardo Ortale;Ettore Ritacco
Affiliations:
-;-;-;-
Venue:
Data & Knowledge Engineering
Year:
2013

Citing 49
Cited 1

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
Change detection in hierarchically structured information

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Meaningful change detection in structured data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Lore: a database management system for semistructured data

ACM SIGMOD Record
Extracting schema from semistructured data

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Discovering typical structures of documents: a road map approach

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Storing semistructured data with STORED

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Semantic integration of semistructured and structured data sources

ACM SIGMOD Record
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
Matching Hierarchical Structures Using Association Graphs

IEEE Transactions on Pattern Analysis and Machine Intelligence
Data on the Web: from relations to semistructured data and XML

Data on the Web: from relations to semistructured data and XML
Turbo-charging vertical mining of large databases

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Modern Information Retrieval

Modern Information Retrieval
Cluster validity methods: part I

ACM SIGMOD Record
XClust: clustering XML schemas for effective integration

Proceedings of the eleventh international conference on Information and knowledge management
A System for Approximate Tree Matching

IEEE Transactions on Knowledge and Data Engineering
Scalable Algorithms for Association Mining

IEEE Transactions on Knowledge and Data Engineering
Tamino - A DBMS designed for XML

Proceedings of the 17th International Conference on Data Engineering
Relational Databases for Querying XML Documents: Limitations and Opportunities

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Comparing Hierarchical Data in External Memory

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Storage and Retrieval of XML Documents Using Object-Relational Databases

DEXA '99 Proceedings of the 10th International Conference on Database and Expert Systems Applications
TIMBER: A native XML database

The VLDB Journal — The International Journal on Very Large Data Bases
Anatomy of a native XML base management system

The VLDB Journal — The International Journal on Very Large Data Bases
Detecting Changes in XML Documents

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

IEEE Transactions on Knowledge and Data Engineering
Fast vertical mining using diffsets

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications

Information Systems - Special issue on web data integration
Element matching across data-oriented XML sources using a multi-strategy clustering model

Data & Knowledge Engineering
Fast Detection of XML Structural Similarity

IEEE Transactions on Knowledge and Data Engineering
A partition index for XML and semi-structured data

Data & Knowledge Engineering
A tree-based approach to clustering XML documents by structure

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications

IEEE Transactions on Knowledge and Data Engineering
XRules: An effective algorithm for structural classification of XML data

Machine Learning
The Wikipedia XML corpus

ACM SIGIR Forum
Indexing graph-structured XML data for efficient structural join operation

Data & Knowledge Engineering
Introduction to the special issue on XML retrieval

ACM Transactions on Information Systems (TOIS)
A clustering method based on path similarities of XML data

Data & Knowledge Engineering
Xproj: a framework for projected structural clustering of xml documents

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Measuring the structural similarity of semistructured documents using entropy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
XEdge: clustering homogeneous and heterogeneous XML documents using edge summaries

Proceedings of the 2008 ACM symposium on Applied computing
A methodology for clustering XML documents by structure

Information Systems
Overview of the INEX 2009 XML mining track: clustering and classification of XML documents

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
XML data clustering: An overview

ACM Computing Surveys (CSUR)
Effective XML Classification Using Content and Structural Information via Rule Learning

ICTAI '11 Proceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence
XCLS: a fast and effective clustering algorithm for heterogenous XML documents

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
X-Class: Associative Classification of XML Documents by Structure

ACM Transactions on Information Systems (TOIS)
On Effective XML Clustering by Path Commonality: An Efficient and Scalable Algorithm

ICTAI '12 Proceedings of the 2012 IEEE 24th International Conference on Tools with Artificial Intelligence - Volume 01

Editorial: COMPENDIUM: A text summarization system for generating abstracts of research papers

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering XML documents by structure is the task of grouping them by common structural components. Hitherto, this has been accomplished by looking at the occurrence of one preestablished type of structural components in the structures of the XML documents. However, the a-priori chosen structural components may not be the most appropriate for effective clustering. Moreover, it is likely that the resulting clusters exhibit a certain extent of inner structural inhomogeneity, because of uncaught differences in the structures of the XML documents, due to further neglected forms of structural components. To overcome these limitations, a new hierarchical approach is proposed, that allows to consider (if necessary) multiple forms of structural components to isolate structurally-homogeneous clusters of XML documents. At each level of the resulting hierarchy, clusters are divided by considering some type of structural components (unaddressed at the preceding levels), that still differentiate the structures of the XML documents. Each cluster in the hierarchy is summarized through a novel technique, that provides a clear and differentiated understanding of its structural properties. A comparative evaluation over both real and synthetic XML data proves that the devised approach outperforms established competitors in effectiveness and scalability. Cluster summarization is also shown to be very representative.