Document-centric OLAP in the schema-chaos world

  • Authors:
  • Yannis Sismanis;Berthold Reinwald;Hamid Pirahesh

  • Affiliations:
  • IBM Almaden Research Center;IBM Almaden Research Center;IBM Almaden Research Center

  • Venue:
  • BIRTE'06 Proceedings of the 1st international conference on Business intelligence for the real-time enterprises
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Gaining business insights such as measuring the effectiveness of a product campaign requires the integration of a multitude of different data sources. Such data sources include in-house applications (like CRM, ERP), partner databases (like loyalty card data from retailers), and syndicated data sources (like credit reports from Experian). However, different data sources represent the same semantic attributes in different ways. E.g., two XML schemas for purchase orders may represent price as /SAP46Order/Product/Price or /PeopleSoft/Item/Sold/ Cost, respectively. The different paths to the same semantic information depend on the schema, making it difficult to index the data and for query languages such as XQuery to process aggregation queries. Shredding the XML documents is not feasible due to the vast number of different schemas and the complexity of the XML documents. The only known approach today is to ETL every single document into a common schema, and then use XQuery on the transformed data to perform aggregation. Such a solution does not scale well with the number of schemas or their natural evoluation. This paper presents a robust solution to document-centric OLAP over highly-heterogeneous data. The solution is based on the exploitation of text-indexing that provides the necessary flexibility and well-established techniques for aggregation (like star-joins and bitmap processing). We present the overall architecture and the experimental performance results from our implementation.