Cardinality estimation in ETL processes

Authors:
Maik Thiele;Tim Kiefer;Wolfgang Lehner
Affiliations:
Technische Universität Dresden, Dresden, Germany;Technische Universität Dresden, Dresden, Germany;Technische Universität Dresden, Dresden, Germany
Venue:
Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP
Year:
2009

Citing 12
Cited 1

An overview of query optimization in relational systems

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
AJAX: an extensible data cleaning tool

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data mining: practical machine learning tools and techniques with Java implementations

ACM SIGMOD Record
Exploiting statistics on query expressions for optimization

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Conceptual modeling for ETL processes

Proceedings of the 5th ACM international workshop on Data Warehousing and OLAP
Continuous queries over data streams

ACM SIGMOD Record
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
LEO - DB2's LEarning Optimizer

Proceedings of the 27th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Optimizing ETL Processes in Data Warehouses

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Deciding the physical implementation of ETL workflows

Proceedings of the ACM tenth international workshop on Data warehousing and OLAP
Partition-based workload scheduling in living data warehouse environments

Proceedings of the ACM tenth international workshop on Data warehousing and OLAP

E-ETL: framework for managing evolving etl processes

Proceedings of the 4th workshop on Workshop for Ph.D. students in information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The cardinality estimation in ETL processes is particularly difficult. Aside from the well-known SQL operators, which are also used in ETL processes, there are a variety of operators without exact counterparts in the relational world. In addition to those, we find operators that support very specific data integration aspects. For such operators, there are no well-examined statistic approaches for cardinality estimations. Therefore, we propose a black-box approach and estimate the cardinality using a set of statistic models for each operator. We discuss different model granularities and develop an adaptive cardinality estimation framework for ETL processes. We map the abstract model operators to specific statistic learning approaches (regression, decision trees, support vector machines, etc.) and evaluate our cardinality estimations in an extensive experimental study.