The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval
The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval
Simple fast algorithms for the editing distance between trees and related problems
SIAM Journal on Computing
Introduction to algorithms
Evaluating text categorization
HLT '91 Proceedings of the workshop on Speech and Natural Language
Information retrieval
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Change detection in hierarchically structured information
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Reexamining the cluster hypothesis: scatter/gather on retrieval results
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
The TSIMMIS Approach to Mediation: Data Models and Languages
Journal of Intelligent Information Systems - Special issue: next generation information technologies and systems
A graph distance metric based on the maximal common subgraph
Pattern Recognition Letters
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
The String-to-String Correction Problem
Journal of the ACM (JACM)
The Tree-to-Tree Correction Problem
Journal of the ACM (JACM)
XTRACT: a system for extracting document type descriptors from XML documents
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
XIRQL: a query language for information retrieval in XML documents
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval
Statistical synopses for graph-structured XML databases
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
XTRACT: Learning Document Type Descriptors from XML Document Collections
Data Mining and Knowledge Discovery
Object Exchange Across Heterogeneous Information Sources
ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
ICDT '97 Proceedings of the 6th International Conference on Database Theory
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Relational Databases for Querying XML Documents: Limitations and Opportunities
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Comparing Hierarchical Data in External Memory
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Searching XML documents via XML fragments
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Detecting Changes in XML Documents
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Clustering Algorithms and Validity Measures
SSDBM '01 Proceedings of the 13th International Conference on Scientific and Statistical Database Management
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure
IEEE Transactions on Knowledge and Data Engineering
Configurable indexing and ranking for XML information retrieval
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
PageTailor: reusable end-user customization for the mobile web
Proceedings of the 5th international conference on Mobile systems, applications and services
Measuring the structural similarity of semistructured documents using entropy
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
XEdge: clustering homogeneous and heterogeneous XML documents using edge summaries
Proceedings of the 2008 ACM symposium on Applied computing
XS3: a system for similarity evaluation in multimedia-based heterogeneous XML repositories
MM '08 Proceedings of the 16th ACM international conference on Multimedia
CONTOUR: an efficient algorithm for discovering discriminating subsequences
Data Mining and Knowledge Discovery
A schema matching-based approach to XML schema clustering
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
On the functional quality of service (FQoS) to discover and compose interoperable web services
Expert Systems with Applications: An International Journal
On Finding Templates on Web Collections
World Wide Web
The XTREEM Methods for Ontology Learning from Web Documents
Proceedings of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge
Xdiff+: a visualization system for XML documents and Schemata
Proceedings of the 46th Annual Southeast Regional Conference on XX
Similarity Evaluation of XML Documents Based on Weighted Element Tree Model
ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Toward approximate GML retrieval based on structural and semantic characteristics
ICWE'10 Proceedings of the 10th international conference on Web engineering
Hi-index | 0.00 |
The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of clustering methods with distances that estimate the similarity between tree structures. This paper presents a framework for clustering XML documents by structure. Modeling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Our approach is tested using a prototype testbed.