Simple fast algorithms for the editing distance between trees and related problems
SIAM Journal on Computing
Change detection in hierarchically structured information
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Meaningful change detection in structured data
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
An introduction to Kolmogorov complexity and its applications (2nd ed.)
An introduction to Kolmogorov complexity and its applications (2nd ed.)
Extracting schema from semistructured data
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
IEEE Transactions on Pattern Analysis and Machine Intelligence
On the Length of Programs for Computing Finite Binary Sequences
Journal of the ACM (JACM)
The Tree-to-Tree Correction Problem
Journal of the ACM (JACM)
XTRACT: a system for extracting document type descriptors from XML documents
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Modern Information Retrieval
The XXL search engine: ranked retrieval of XML data using indexes and ontologies
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
XClust: clustering XML schemas for effective integration
Proceedings of the eleventh international conference on Information and knowledge management
A System for Approximate Tree Matching
IEEE Transactions on Knowledge and Data Engineering
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure
IEEE Transactions on Knowledge and Data Engineering
ISICT '03 Proceedings of the 1st international symposium on Information and communication technologies
Information Systems - Special issue on web data integration
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Fast Detection of XML Structural Similarity
IEEE Transactions on Knowledge and Data Engineering
Approximate matching of hierarchical data using pq-grams
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
CP/CV: concept similarity mining without frequency information from domain describing taxonomies
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Measuring the structural similarity of semistructured documents using entropy
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Linear pattern matching algorithms
SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Propagation-vectors for trees (PVT): concise yet effective summaries for hierarchical data and trees
Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
Approximate Joins for Data-Centric XML
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
The pq-gram distance between ordered labeled trees
ACM Transactions on Database Systems (TODS)
A methodology for clustering XML documents by structure
Information Systems
Information theoretic text classification using the ziv-merhav method
IbPRIA'05 Proceedings of the Second Iberian conference on Pattern Recognition and Image Analysis - Volume Part II
IEEE Transactions on Information Theory
A universal algorithm for sequential data compression
IEEE Transactions on Information Theory
IEEE Transactions on Information Theory
IEEE Transactions on Information Theory
Efficient processing of containment queries on nested sets
Proceedings of the 16th International Conference on Extending Database Technology
Hi-index | 0.00 |
We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on information-theoretic concepts. Common to all approaches is a two-step procedure: first, we extract and linearize the structural information from documents, and then, we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entropy to determine the distance between the documents. Compared to other approaches, we are able to achieve a linear run-time complexity and demonstrate in an experimental evaluation that the results of our technique in terms of clustering quality are on a par with or even better than those of other, slower approaches.