Nested mappings: schema mapping reloaded

  • Authors:
  • Ariel Fuxman;Mauricio A. Hernandez;Howard Ho;Renee J. Miller;Paolo Papotti;Lucian Popa

  • Affiliations:
  • University of Toronto and IBM Almaden Research Center;IBM Almaden Research Center;IBM Almaden Research Center;University of Toronto;Università Roma Tre and IBM Almaden Research Center;IBM Almaden Research Center

  • Venue:
  • VLDB '06 Proceedings of the 32nd international conference on Very large data bases
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many problems in information integration rely on specifications, called schema mappings, that model the relationships between schemas. Schema mappings for both relational and nested data are well-known. In this work, we present a new formalism for schema mapping that extends these existing formalisms in two significant ways. First, our nested mappings allow for nesting and correlation of mappings. This results in a natural programming paradigm that often yields more accurate specifications. In particular, we show that nested mappings can naturally preserve correlations among data that existing mapping formalisms cannot. We also show that using nested mappings for purposes of exchanging data from a source to a target will result in less redundancy in the target data. The second extension to the mapping formalism is the ability to express, in a declarative way, grouping and data merging semantics. This semantics can be easily changed and customized to the integration task at hand. We present a new algorithm for the automatic generation of nested mappings from schema matchings (that is, simple element-to-element correspondences between schemas). We have implemented this algorithm, along with algorithms for the generation of transformation queries (e.g., XQuery) based on the nested mapping specification. We show that the generation algorithms scale well to large, highly nested schemas. We also show that using nested mappings in data exchange can drastically reduce the execution cost of producing a target instance, particularly over large data sources, and can also dramatically improve the quality of the generated data.