Mining frequent association tag sequences for clustering XML documents

Authors:
Lijun Zhang;Zhanhuai Li;Qun Chen;Xia Li;Ning Li;Ying Lou
Affiliations:
School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an, China;School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an, China;School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an, China;School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an, China;School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an, China;School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an, China
Venue:
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Year:
2012

Citing 13
Cited 0

On the editing distance between unordered labeled trees

Information Processing Letters
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth

Proceedings of the 17th International Conference on Data Engineering
Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
TreeFinder: a First Step towards XML Data Mining

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
CLOSET+: searching for the best strategies for mining frequent closed itemsets

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A bag of paths model for measuring structural similarity in Web documents

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications

IEEE Transactions on Knowledge and Data Engineering
XML Document Clustering Using Common XPath

WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Finding Syntactic Similarities Between XML Documents

DEXA '06 Proceedings of the 17th International Conference on Database and Expert Systems Applications
Evaluate structure similarity in XML documents with merge-edit-distance

PAKDD'07 Proceedings of the 2007 international conference on Emerging technologies in knowledge discovery and data mining
Classification of XSLT-Generated web documents with support vector machines

KDXD'06 Proceedings of the First international conference on Knowledge Discovery from XML Documents
Survey: An overview on XML similarity: Background, current trends and future directions

Computer Science Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many XML document clustering algorithms need to compute similarity among documents. Due to its semi-structured characteristic, exploiting the structure information for computing structural similarity is a crucial issue in XML similarity computation. Some path based approaches model the structure as path set and use the path set to compute structural similarity. One of the defects of these approaches is that they ignore the relationship between paths. In this paper, we propose the conception of Frequent Association Tag Sequences (FATS). Based on this conception, we devise an algorithm named FATSMiner for mining FATS and model the structure of XML documents as FATS set, and introduce a method for computing structural similarity using FATS. Because FATS implies the ancestor-descendant and sibling relationships among elements, this approach can better represent the structure of XML documents. Our experimental results on real datasets show that this approach is more effective than the other path based approaches.