Measuring structural similarity of semistructured data based on information-theoretic approaches

Authors:
Sven Helmer;Nikolaus Augsten;Michael Böhlen
Affiliations:
Birkbeck, University of London, London, UK WC1E 7HX;Free University of Bozen-Bolzano, Bozen-Bolzano, Italy 39100;University of Zurich, Zurich, Switzerland 8050
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2012

Citing 36
Cited 1

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
Change detection in hierarchically structured information

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Meaningful change detection in structured data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
An introduction to Kolmogorov complexity and its applications (2nd ed.)

An introduction to Kolmogorov complexity and its applications (2nd ed.)
Extracting schema from semistructured data

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Similarity Measures

IEEE Transactions on Pattern Analysis and Machine Intelligence
On the Length of Programs for Computing Finite Binary Sequences

Journal of the ACM (JACM)
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Modern Information Retrieval

Modern Information Retrieval
The XXL search engine: ranked retrieval of XML data using indexes and ontologies

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
XClust: clustering XML schemas for effective integration

Proceedings of the eleventh international conference on Information and knowledge management
A System for Approximate Tree Matching

IEEE Transactions on Knowledge and Data Engineering
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

IEEE Transactions on Knowledge and Data Engineering
An abstraction-based approach to measuring the structural similarity between two unordered XML documents

ISICT '03 Proceedings of the 1st international symposium on Information and communication technologies
A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications

Information Systems - Special issue on web data integration
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Fast Detection of XML Structural Similarity

IEEE Transactions on Knowledge and Data Engineering
Approximate matching of hierarchical data using pq-grams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Studying the XML Web: Gathering Statistics from an XML Sample

World Wide Web
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)

Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
CP/CV: concept similarity mining without frequency information from domain describing taxonomies

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Measuring the structural similarity of semistructured documents using entropy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Propagation-vectors for trees (PVT): concise yet effective summaries for hierarchical data and trees

Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
Approximate Joins for Data-Centric XML

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
The pq-gram distance between ordered labeled trees

ACM Transactions on Database Systems (TODS)
A methodology for clustering XML documents by structure

Information Systems
Information theoretic text classification using the ziv-merhav method

IbPRIA'05 Proceedings of the Second Iberian conference on Pattern Recognition and Image Analysis - Volume Part II
Information distance

IEEE Transactions on Information Theory
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
Clustering by compression

IEEE Transactions on Information Theory
A measure of relative entropy between individual sequences with application to universal classification

IEEE Transactions on Information Theory

Efficient processing of containment queries on nested sets

Proceedings of the 16th International Conference on Extending Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on information-theoretic concepts. Common to all approaches is a two-step procedure: first, we extract and linearize the structural information from documents, and then, we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entropy to determine the distance between the documents. Compared to other approaches, we are able to achieve a linear run-time complexity and demonstrate in an experimental evaluation that the results of our technique in terms of clustering quality are on a par with or even better than those of other, slower approaches.