Simple fast algorithms for the editing distance between trees and related problems
SIAM Journal on Computing
The advantages of electronic data interchange
ACM SIGMIS Database
A mathematical theory of communication
ACM SIGMOBILE Mobile Computing and Communications Review
XClust: clustering XML schemas for effective integration
Proceedings of the eleventh international conference on Information and knowledge management
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
ISICT '03 Proceedings of the 1st international symposium on Information and communication technologies
Exploiting structural similarity for effective Web information extraction
Data & Knowledge Engineering
Measuring the structural similarity of semistructured documents using entropy
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
An Introduction to Kolmogorov Complexity and Its Applications
An Introduction to Kolmogorov Complexity and Its Applications
A Tree Distance Function Based on Multi-sets
New Frontiers in Applied Data Mining
A methodology for clustering XML documents by structure
Information Systems
IEEE Transactions on Information Theory
Shared information and program plagiarism detection
IEEE Transactions on Information Theory
Towards a universal information distance for structured data
Proceedings of the Fourth International Conference on SImilarity Search and APplications
A multivariate correlation distance for vector spaces
SISAP'12 Proceedings of the 5th international conference on Similarity Search and Applications
Hi-index | 0.00 |
Comparing tree-structured data for structural similarity is a recurring theme and one on which much effort has been spent. Most approaches so far are grounded, implicitly or explicitly, in algorithmic information theory, being approximations to an information distance derived from Kolmogorov complexity. In this paper we propose a novel complexity metric, also grounded in information theory, but calculated via Shannon's entropy equations. This is used to formulate a directly and efficiently computable metric for the structural difference between unordered trees. The paper explains the derivation of the metric in terms of information theory, and proves the essential property that it is a distance metric. The property of boundedness means that the metric can be used in contexts such as clustering, where second-order comparisons are required. The distance metric property means that the metric can be used in the context of similarity search and metric spaces in general, allowing trees to be indexed and stored within this domain. We are not aware of any other tree similarity metric with these properties.