On repairing structural problems in semi-structured data

  • Authors:
  • Flip Korn;Barna Saha;Divesh Srivastava;Shanshan Ying

  • Affiliations:
  • AT&T Labs-Research;AT&T Labs-Research;AT&T Labs-Research;Nat'l Univ Singapore

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Semi-structured data such as XML are popular for data interchange and storage. However, many XML documents have improper nesting where open - and close-tags are unmatched. Since some semi-structured data (e.g., Latex) have a flexible grammar and since many XML documents lack an accompanying DTD or XSD, we focus on computing a syntactic repair via the edit distance. To solve this problem, we propose a dynamic programming algorithm which takes cubic time. While this algorithm is not scalable, well-formed substrings of the data can be pruned to enable faster computation. Unfortunately, there are still cases where the dynamic program could be very expensive; hence, we give branch-and-bound algorithms based on various combinations of two heuristics, called MinCost and MaxBenefit, that trade off between accuracy and efficiency. Finally, we experimentally demonstrate the performance of these algorithms on real data.