Document-centric OLAP in the schema-chaos world

Authors:
Yannis Sismanis;Berthold Reinwald;Hamid Pirahesh
Affiliations:
IBM Almaden Research Center;IBM Almaden Research Center;IBM Almaden Research Center
Venue:
BIRTE'06 Proceedings of the 1st international conference on Business intelligence for the real-time enterprises
Year:
2006

Citing 13
Cited 2

Data placement in Bubba

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Research problems in data warehousing

CIKM '95 Proceedings of the fourth international conference on Information and knowledge management
“One size fits all” database architectures do not work for DSS

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Generic Schema Matching with Cupid

Proceedings of the 27th International Conference on Very Large Data Bases
XML-Extended OLAP Querying

SSDBM '02 Proceedings of the 14th International Conference on Scientific and Statistical Database Management
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Integrating XML Data in the TARGITOLAP System

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Extending XQuery for analytics

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
From databases to dataspaces: a new abstraction for information management

ACM SIGMOD Record
GORDIAN: efficient and scalable discovery of composite keys

VLDB '06 Proceedings of the 32nd international conference on Very large data bases

IBM UFO repository: object-oriented data integration

Proceedings of the VLDB Endowment
Interesting-phrase mining for ad-hoc text analytics

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gaining business insights such as measuring the effectiveness of a product campaign requires the integration of a multitude of different data sources. Such data sources include in-house applications (like CRM, ERP), partner databases (like loyalty card data from retailers), and syndicated data sources (like credit reports from Experian). However, different data sources represent the same semantic attributes in different ways. E.g., two XML schemas for purchase orders may represent price as /SAP46Order/Product/Price or /PeopleSoft/Item/Sold/ Cost, respectively. The different paths to the same semantic information depend on the schema, making it difficult to index the data and for query languages such as XQuery to process aggregation queries. Shredding the XML documents is not feasible due to the vast number of different schemas and the complexity of the XML documents. The only known approach today is to ETL every single document into a common schema, and then use XQuery on the transformed data to perform aggregation. Such a solution does not scale well with the number of schemas or their natural evoluation. This paper presents a robust solution to document-centric OLAP over highly-heterogeneous data. The solution is based on the exploitation of text-indexing that provides the necessary flexibility and well-established techniques for aggregation (like star-joins and bitmap processing). We present the overall architecture and the experimental performance results from our implementation.