An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

Authors:
Wang Lian;David Wai-lok Cheung;Nikos Mamoulis;Siu-Ming Yiu
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2004

Citing 12
Cited 56

Matrix multiplication via arithmetic progressions

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
Lore: a database management system for semistructured data

ACM SIGMOD Record
A graph distance metric based on the maximal common subgraph

Pattern Recognition Letters
Storing semistructured data with STORED

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
XML linking

ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Relational Databases for Querying XML Documents: Limitations and Opportunities

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Querying and Updating the File

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Exploiting Local Similarity for Indexing Paths in Graph-Structured Data

ICDE '02 Proceedings of the 18th International Conference on Data Engineering

Fast Detection of XML Structural Similarity

IEEE Transactions on Knowledge and Data Engineering
Peer-to-peer management of XML data: issues and research challenges

ACM SIGMOD Record
Web data extraction based on structural similarity

Knowledge and Information Systems
A methodology for clustering XML documents by structure

Information Systems
FRACTURE mining: mining frequently and concurrently mutating structures from historical XML documents

Data & Knowledge Engineering - Special issue: WIDM 2004
XML structural delta mining: issues and challenges

Data & Knowledge Engineering - Special issue: ER 2003
A multidimensional scaling approach for representing XML documents

ACM-SE 45 Proceedings of the 45th annual southeast regional conference
Xproj: a framework for projected structural clustering of xml documents

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An agent framework for recommendation

TELE-INFO'07 Proceedings of the 6th WSEAS Int. Conference on Telecommunications and Informatics
A heuristic algorithm for clustering rooted ordered trees

Intelligent Data Analysis
Similarity Measurement of XML Documents Based on Structure and Contents

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part III: ICCS 2007
Multilevel Conditional Fuzzy C-Means Clustering of XML Documents

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Combining Web Usage Mining and XML Mining in a Real Case Study

From Web to Social Web: Discovering and Deploying User and Content Profiles
Document Clustering Using Incremental and Pairwise Approaches

Focused Access to XML Documents
An Effective Data Processing Method for Fast Clustering

ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
Support for seamless data exchanges between web services through information mapping analysis using kernel methods

Expert Systems with Applications: An International Journal
On Finding Templates on Web Collections

World Wide Web
In the Search of NECTARs from Evolutionary Trees

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
COWES: Web user clustering based on evolutionary web sessions

Data & Knowledge Engineering
Data Discovery and Related Factors of Documents on the Web and the Network

ICCSA '09 Proceedings of the International Conference on Computational Science and Its Applications: Part I
A system for detecting xml similarity in content and structure using relational database

Proceedings of the 18th ACM conference on Information and knowledge management
Semantic clustering of XML documents

ACM Transactions on Information Systems (TOIS)
A methodology for clustering XML documents by structure

Information Systems
Return specification inference and result clustering for keyword search on XML

ACM Transactions on Database Systems (TODS)
A structure-based clustering on LDAP directory information

ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
Semantics-guided clustering of heterogeneous XML schemas

Journal on data semantics IX
An effective detection method for clustering similar XML DTDs using tag sequences

ICCSA'07 Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part II
Improving XML search by generating and utilizing informative result snippets

ACM Transactions on Database Systems (TODS)
GRAMS3: an efficient framework for XML structural similarity search

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
Highly efficient algorithms for structural clustering of large websites

Proceedings of the 20th international conference on World wide web
XML data clustering: An overview

ACM Computing Surveys (CSUR)
A Clustering-Driven LDAP Framework

ACM Transactions on the Web (TWEB)
A model for complex tree integration tasks

ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I
XStreamCluster: an efficient algorithm for streaming XML data clustering

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
A cluster-based approach to web adaptation in context-aware applications

Journal of Web Engineering
Collaborative clustering of XML documents

Journal of Computer and System Sciences
Clust-XPaths: clustering of XML paths

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
A complete path representation method with a modified inverted index for efficient retrieval of XML documents

WSEAS Transactions on Computers
COWES: clustering web users based on historical web sessions

DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
A flexible structured-based representation for XML document mining

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
Sequential pattern mining for structure-based XML document classification

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
Transforming XML trees for efficient classification and clustering

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
Clustering OWL documents based on semantic analysis

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Workflow clustering method based on process similarity

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part II
A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics

Web Semantics: Science, Services and Agents on the World Wide Web
Mining positive and negative association rules from XML query patterns for caching

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Clustering XML documents by structure

ADBIS'09 Proceedings of the 13th East European conference on Advances in Databases and Information Systems
XML document clustering using structure-preserving flat representation of XML content and structure

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Decision support in e-business based on assessing similarities between ontologies

Knowledge-Based Systems
Survey: An overview on XML similarity: Background, current trends and future directions

Computer Science Review
Measuring structural similarity of semistructured data based on information-theoretic approaches

The VLDB Journal — The International Journal on Very Large Data Bases
Exploring dictionary-based semantic relatedness in labeled tree data

Information Sciences: an International Journal
X-Class: Associative Classification of XML Documents by Structure

ACM Transactions on Information Systems (TOIS)
Hierarchical clustering of XML documents focused on structural components

Data & Knowledge Engineering
Combining structure and content similarities for XML document clustering

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
Comparing top-k XML lists

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract--With the standardization of XML as an information exchange language over the net, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection.