On the use of hierarchical information in sequential mining-based XML document similarity computation

Authors:
Ho-Pong Leung;Fu-Lai Chung;Stephen Chi-Fai Chan
Affiliations:
Hong Kong Polytechnic University, Department of Computing, Hunghom, Kowloon, Hong Kong;Hong Kong Polytechnic University, Department of Computing, Hunghom, Kowloon, Hong Kong;Hong Kong Polytechnic University, Department of Computing, Hunghom, Kowloon, Hong Kong
Venue:
Knowledge and Information Systems
Year:
2005

Citing 13
Cited 8

On a relation between graph edit distance and maximum common subgraph

Pattern Recognition Letters
Extracting schema from semistructured data

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
XMill: an efficient compressor for XML data

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Approximate XML joins

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Preparations for Semantics-Based XML Mining

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Efficient Filtering of XML Documents for Selective Dissemination of Information

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
WebFilter: A High-throughput XML-based Publish and Subscribe System

Proceedings of the 27th International Conference on Very Large Data Bases
Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Efficient Filtering of XML Documents with XPath Expressions

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
DTD-Miner: A Tool for Mining DTD from XML Documents

WECWIS '00 Proceedings of the Second International Workshop on Advance Issues of E-Commerce and Web-Based Information Systems (WECWIS 2000)
Tree pattern aggregation for scalable XML data dissemination

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

XML schema clustering with semantic and hierarchical similarity measures

Knowledge-Based Systems
Investigating Semantic Measures in XML Clustering

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
A heuristic algorithm for clustering rooted ordered trees

Intelligent Data Analysis
Process of applying data mining techniques to XML data

Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006
BusSEngine: a business search engine

Knowledge and Information Systems
XML data clustering: An overview

ACM Computing Surveys (CSUR)
XCLS: a fast and effective clustering algorithm for heterogenous XML documents

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
XML documents clustering by structures

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Measuring the structural similarity among XML documents is the task of finding their semantic correspondence and is fundamental to many web-based applications. While there exist several methods to address the problem, the data mining approach seems to be a novel, interesting and promising one. It explores the idea of extracting paths from XML documents, encoding them as sequences and finding the maximal frequent sequences using the sequential pattern mining algorithms. In view of the deficiencies encountered by ignoring the hierarchical information in encoding the paths for mining, a new sequential pattern mining scheme for XML document similarity computation is proposed in this paper. It makes use of a preorder tree representation (PTR) to encode the XML tree’s paths so that both the semantics of the elements and the hierarchical structure of the document can be taken into account when computing the structural similarity among documents. In addition, it proposes a postprocessing step to reuse the mined patterns to estimate the similarity of unmatched elements so that another metric to qualify the similarity between XML documents can be introduced. Encouraging experimental results were obtained and reported.