Efficient discovery of XML data redundancies

  • Authors:
  • Cong Yu;H. V. Jagadish

  • Affiliations:
  • Department of EECS, University of Michigan;Department of EECS, University of Michigan

  • Venue:
  • VLDB '06 Proceedings of the 32nd international conference on Very large data bases
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

As XML becomes widely used, dealing with redundancies in XML data has become an increasingly important issue. Redundantly stored information can lead not just to a higher data storage cost, but also to increased costs for data transfer and data manipulation. Furthermore, such data redundancies can lead to potential update anomalies, rendering the database inconsistent.One way to avoid data redundancies is to employ good schema design based on known functional dependencies. In fact, several recent studies have focused on defining the notion of XML Functional Dependencies (XML FDs) to capture XML data redundancies. We observe further that XML databases are often "casually designed" and XML FDs may not be determined in advance. Under such circumstances, discovering XML data redundancies (in terms of FDs) from the data itself becomes necessary and is an integral part of the schema refinement process.In this paper, we present the design and implementation of the first system, DiscoverXFD, for effcient discovery of XML data redundancies. It employs a novel XML data structure and introduces a new class of partition based algorithms. DiscoverXFD can not only be used for the previous definitions of XML functional dependencies, but also for a more comprehensive notion we develop in this paper, capable of detecting redundancies involving set elements while maintaining clear semantics. Experimental evaluations using real life and benchmark datasets demonstrate that our system is practical and scales well with increasing data size.