Efficient resumption of interrupted warehouse loads

Authors:
Wilburt Juan Labio;Janet L. Wiener;Hector Garcia-Molina;Vlad Gorelik
Affiliations:
Gigabeat, Inc. Palo Alto CA;Compaq SRC, Palo Alto, CA;Stanford University;Sagent Technologies
Venue:
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Year:
2000

Citing 6
Cited 26

Implementing recoverable requests using queues

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Algorithms for creating indexes for very large tables without quiescing updates

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Principles of transaction processing: for the systems professional

Principles of transaction processing: for the systems professional
Transaction Processing: Concepts and Techniques

Transaction Processing: Concepts and Techniques
NCR 3700 - The Next-Generation Industrial Database Computer

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
OODB Bulk Loading Revisited: The Partitioned-List Approach

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases

Fault-tolerant, load-balancing queries in telegraph

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Conceptual modeling for ETL processes

Proceedings of the 5th ACM international workshop on Data Warehousing and OLAP
Lineage Tracing for General Data Warehouse Transformations

Proceedings of the 27th International Conference on Very Large Data Bases
Lineage tracing for general data warehouse transformations

The VLDB Journal — The International Journal on Very Large Data Bases
A declarative approach to optimize bulk loading into databases

ACM Transactions on Database Systems (TODS)
Optimizing ETL Processes in Data Warehouses

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
High-Availability Algorithms for Distributed Stream Processing

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
ETL queues for active data warehousing

Proceedings of the 2nd international workshop on Information quality in information systems
State-Space Optimization of ETL Workflows

IEEE Transactions on Knowledge and Data Engineering
A generic and customizable framework for the design of ETL scenarios

Information Systems - Special issue: The 15th international conference on advanced information systems engineering (CAiSE 2003)
Query suspend and resume

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Stop-and-restart style execution for long running decision support queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
PROQID: partial restarts of queries in distributed databases

Proceedings of the 17th ACM conference on Information and knowledge management
Architecture of Parallel Spatial Data Warehouse: Balancing Algorithm and Resumption of Data Extraction

Proceedings of the 2005 conference on Software Engineering: Evolution and Emerging Technologies
A generic and customizable framework for the design of ETL scenarios

Information Systems - Special issue: The 15th international conference on advanced information systems engineering (CAiSE 2003)
Towards automated analysis of connections network in distributed stream processing system

DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
Collecting data streams from a distributed radio-based measurement system

DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
R-MESHJOIN for near-real-time data warehousing

DOLAP '10 Proceedings of the ACM 13th international workshop on Data warehousing and OLAP
A latency and fault-tolerance optimizer for online parallel query plans

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Changing flights in mid-air: a model for safely modifying continuous queries

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Update propagation in a streaming warehouse

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
X-HYBRIDJOIN for near-real-time data warehousing

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Resumption of data extraction process in parallel data warehouses

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Bulk loading a linear hash file

DaWaK'06 Proceedings of the 8th international conference on Data Warehousing and Knowledge Discovery
Influence of balancing used in a distributed data warehouse on the extraction process

TEAA'05 Proceedings of the 31st VLDB conference on Trends in Enterprise Application Architecture
HYBRIDJOIN for Near-Real-Time Data Warehousing

International Journal of Data Warehousing and Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data warehouses collect large quantities of data from distributed sources into a single repository. A typical load to create or maintain a warehouse processes GBs of data, takes hours or even days to execute, and involves many complex and user-defined transformations of the data (e.g., find duplicates, resolve data inconsistencies, and add unique keys). If the load fails, a possible approach is to “redo” the entire load. A better approach is to resume the incomplete load from where it was interrupted. Unfortunately, traditional algorithms for resuming the load either impose unacceptable overhead during normal operation, or rely on the specifics of transformations. We develop a resumption algorithm called DR that imposes no overhead and relies only on the high-level properties of the transformations. We show that DR can lead to a ten-fold reduction in resumption time by performing experiments using commercial software.