XML version detection

  • Authors:
  • Deise de Brum Saccol;Nina Edelweiss;Renata de Matos Galante;Carlo Zaniolo

  • Affiliations:
  • Universidade Federal do Rio Grande do Sul;Universidade Federal do Rio Grande do Sul;Universidade Federal do Rio Grande do Sul;University of California

  • Venue:
  • Proceedings of the 2007 ACM symposium on Document engineering
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The problem of version detection is critical in many important application scenarios, including software clone identification, Web page ranking, plagiarism detection, and peer-to-peer searching. A natural and commonly used approach to version detection relies on analyzing the similarity between files. Most of the techniques proposed so far rely on the use of hard thresholds for similarity measures. However, defining a threshold value is problematic for several reasons: in particular (i) the threshold value is not the same when considering different similarity functions, and (ii) it is not semantically meaningful for the user. To overcome this problem, our work proposes a version detection mechanism for XML documents based on Naïve Bayesian classifiers. Thus, our approach turns the detection problem into a classification problem. In this paper, we present the results of various experiments on synthetic data that show that our approach produces very good results, both in terms of recall and precision measures.