Xml data warehousing

  • Authors:
  • Hosagrahar V. Jagadish;Nuwee Wiwatwattana

  • Affiliations:
  • University of Michigan;University of Michigan

  • Venue:
  • Xml data warehousing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data warehousing is an important application of database technology. Even though XML is ubiquitous and there are many XML databases, there are almost no XML data warehouses today. This thesis overcomes two of the many barriers towards accomplishing this goal—by representing and manipulating efficiently multiple hierarchies within an XML database used as a warehouse.XML format is flexible, and permits the graceful representation of heterogeneous data. However, it is limited in that it assumes there is a single perfect hierarchy in which the data can be organized. When the information to be represented naturally has multiple dimensions, as in data warehouses, fundamental tensions appear in the modeling and schema design. Data represented as deep trees is often un-normalized, leading to update anomalies, while normalized data tends to be shallow, resulting in heavy use of expensive value-based joins. As a solution, we propose an evolutionary and novel extension of the standard one-dimensional XML data model into a multi-dimensional model, called the Multi-Colored Trees (MCT) logical data model. MCT permits trees with multi-colored nodes to signify participation in multiple dimensions. We have developed algorithms to transform design specifications given as ER diagrams into MCT schemas. These MCT schemas satisfy various desirable properties, such as node normal form, edge normal form, and association recoverability. Experimental studies with warehousing data show that the schemas we designed in MCT have many benefits over conventional XML schemas, including query efficiency, query expression ease, and update anomaly avoidance.Even after modeling issues are resolved, we still have to consider issues of efficient implementation. We extend bitmap join indices to the XML context, and demonstrate experimentally their benefit for typical queries, including those with low cardinality or high selectivity. We also consider the data cube, a core warehouse analysis operator involving aggregations along multiple dimensions, and show that it cannot readily be expressed or evaluated for XML data. Specifically, XML data is not always summarizable because of missing and repeated sub-elements. We define an XML version of the OLAP cube operator, and appropriately extend relational cube computation algorithms.