A methodology for clustering XML documents by structure

Authors:
Theodore Dalamagas;Tao Cheng;Klaas-Jan Winkel;Timos Sellis
Affiliations:
School of Electrical and Computer Engineering, National Technical University of Athens, Zographou, Athens, Greece;Department of Computer Science, University of California, Santa Barbara, CA;Faculty of Computer Science, University of Twente, AE Enschede, The Netherlands;School of Electrical and Computer Engineering, National Technical University of Athens, Zographou, Athens, Greece
Venue:
Information Systems
Year:
2006

Citing 29
Cited 12

The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval

The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval
Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
Introduction to algorithms

Introduction to algorithms
Evaluating text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
Clustering algorithms

Information retrieval
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Change detection in hierarchically structured information

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
The TSIMMIS Approach to Mediation: Data Models and Languages

Journal of Intelligent Information Systems - Special issue: next generation information technologies and systems
A graph distance metric based on the maximal common subgraph

Pattern Recognition Letters
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
The String-to-String Correction Problem

Journal of the ACM (JACM)
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
XIRQL: a query language for information retrieval in XML documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Statistical synopses for graph-structured XML databases

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
XTRACT: Learning Document Type Descriptors from XML Document Collections

Data Mining and Knowledge Discovery
Object Exchange Across Heterogeneous Information Sources

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Querying Semi-Structured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Relational Databases for Querying XML Documents: Limitations and Opportunities

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Comparing Hierarchical Data in External Memory

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Searching XML documents via XML fragments

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Detecting Changes in XML Documents

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Clustering Algorithms and Validity Measures

SSDBM '01 Proceedings of the 13th International Conference on Scientific and Statistical Database Management
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

IEEE Transactions on Knowledge and Data Engineering
Configurable indexing and ranking for XML information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

PageTailor: reusable end-user customization for the mobile web

Proceedings of the 5th international conference on Mobile systems, applications and services
Measuring the structural similarity of semistructured documents using entropy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
XEdge: clustering homogeneous and heterogeneous XML documents using edge summaries

Proceedings of the 2008 ACM symposium on Applied computing
XS3: a system for similarity evaluation in multimedia-based heterogeneous XML repositories

MM '08 Proceedings of the 16th ACM international conference on Multimedia
CONTOUR: an efficient algorithm for discovering discriminating subsequences

Data Mining and Knowledge Discovery
A schema matching-based approach to XML schema clustering

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
On the functional quality of service (FQoS) to discover and compose interoperable web services

Expert Systems with Applications: An International Journal
On Finding Templates on Web Collections

World Wide Web
The XTREEM Methods for Ontology Learning from Web Documents

Proceedings of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge
Xdiff+: a visualization system for XML documents and Schemata

Proceedings of the 46th Annual Southeast Regional Conference on XX
Similarity Evaluation of XML Documents Based on Weighted Element Tree Model

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Toward approximate GML retrieval based on structural and semantic characteristics

ICWE'10 Proceedings of the 10th international conference on Web engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of clustering methods with distances that estimate the similarity between tree structures. This paper presents a framework for clustering XML documents by structure. Modeling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Our approach is tested using a prototype testbed.