RWS-Diff: flexible and efficient change detection in hierarchical data

  • Authors:
  • Jan P. Finis;Martin Raiber;Nikolaus Augsten;Robert Brunel;Alfons Kemper;Franz Färber

  • Affiliations:
  • Technische Universität München, Munich, Germany;Technische Universität München, Munich, Germany;Universität Salzburg, Salzburg, Austria;Technische Universität München, Munich, Germany;Technische Universität München, Munich, Germany;SAP AG, Walldorf, Germany

  • Venue:
  • Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The problem of generating a cost-minimal edit script between two trees has many important applications. However, finding such a cost-minimal script is computationally hard, thus the only methods that scale are approximate ones. Various approximate solutions have been proposed recently. However, most of them still show quadratic or worse runtime complexity in the tree size and thus do not scale well either. The only solutions with log-linear runtime complexity use simple matching algorithms that only find corresponding subtrees as long as these subtrees are equal. Consequently, such solutions are not robust at all, since small changes in the leaves which occur frequently can make all subtrees that contain the changed leaves unequal and thus prevent the matching of large portions of the trees. This problem could be avoided by searching for similar instead of equal subtrees but current similarity approaches are too costly and thus also show quadratic complexity. Hence, currently no robust log-linear method exists. We propose the random walks similarity (RWS) measure which can be used to find similar subtrees rapidly. We use this measure to build the RWS-Diff algorithm that is able to compute an approximately cost-minimal edit script in log-linear time while having the robustness of a similarity-based approach. Our evaluation reveals that random walk similarity indeed increases edit script quality and robustness drastically while still maintaining a runtime comparable to simple matching approaches.