On the midpoint of a set of XML documents

Authors:
Alberto Abelló;Xavier de Palol;Mohand-Saïd Hacid
Affiliations:
Dept. de Llenguatges i Sistemes Informàtics, U. Politècnica de Catalunya;Dept. de Llenguatges i Sistemes Informàtics, U. Politècnica de Catalunya;LIRIS- UFR d'Informatique, U. Claude Bernard Lyon 1
Venue:
DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications
Year:
2005

Citing 7
Cited 1

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
Extracting schema from semistructured data

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Data on the Web: from relations to semistructured data and XML

Data on the Web: from relations to semistructured data and XML
Normal form algorithms for extended context-free grammars

Theoretical Computer Science
Extracting Information from XML Documents by Reverse Generating a DTD

EurAsia-ICT '02 Proceedings of the First EurAsian Conference on Information and Communication Technology
A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications

Information Systems - Special issue on web data integration
The Description Logic Handbook

The Description Logic Handbook

A method for comparison of standardized information within systems biology

Proceedings of the 38th conference on Winter simulation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The WWW contains a huge amount of documents. Some of them share the subject, but are generated by different people or even organizations. To guarantee the interchange of such documents, we can use XML, which allows to share documents that do not have the same structure. However, it makes difficult to understand the core of such heterogeneous documents (in general, schema is not available). In this paper, we offer a characterization and algorithm to obtain the midpoint (in terms of a resemblance function) of a set of semi-structured, heterogeneous documents without optional elements. The trivial case of midpoint would be the common elements to all documents. Nevertheless, in cases with several heterogeneous documents this may result in an empty set. Thus, we consider that those elements present in a given amount of documents belong to the midpoint. A exact schema could always be found generating optional elements. However, the exact schema of the whole set may result in overspecialization (lots of optional elements), which would make it useless.