XML Document Clustering Using Common XPath

Authors:
Ho-pong Leung;Fu-lai Chung;Stephen C. F. Chan;Robert Luk
Affiliations:
Department of Computing Hong Kong Polytechnic University Hunghom, Hong Kong, China.;Department of Computing Hong Kong Polytechnic University Hunghom, Hong Kong, China.;Department of Computing Hong Kong Polytechnic University Hunghom, Hong Kong, China.;Department of Computing Hong Kong Polytechnic University Hunghom, Hong Kong, China.
Venue:
WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Year:
2005

Citing 0
Cited 7

Similarity Evaluation of XML Documents Based on Weighted Element Tree Model

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Similarity computation for XML documents by XML element sequence patterns

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Structure and content similarity for clustering XML documents

WAIM'10 Proceedings of the 2010 international conference on Web-age information management
A complete path representation method with a modified inverted index for efficient retrieval of XML documents

WSEAS Transactions on Computers
Mining frequent association tag sequences for clustering XML documents

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Structural similarity evaluation of XML documents based on basic statistics

WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
Combining structure and content similarities for XML document clustering

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87

Quantified Score

Hi-index	0.00

Visualization

Abstract

XML is becoming a common way of storing data. The elements and their arrangement in the document's hierarchy not only describe the document structure but also imply the data's semantic meaning, and hence provide valuable information to develop tools for manipulating XML documents. In this paper, we pursue a data mining approach to the problem of XML document clustering. We introduce a novel XML structural representation called common XPath (CXP), which encodes the frequently occurring elements with the hierarchical information, and propose to take the CXPs mined to form the feature vectors for XML document clustering. In other words, data mining acts as a feature extractor in the clustering process. Based on this idea, we devise a path-based XML document clustering algorithm called PBClustering which groups the documents according to their CXPs, i.e. their frequent structures. Encouraging simulation results are observed and reported.