The WHIPS prototype for data warehouse creation and maintenance

  • Authors:
  • Wilburt J. Labio;Yue Zhuge;Janet L. Wiener;Himanshu Gupta;Héctor García-Molina;Jennifer Widom

  • Affiliations:
  • Department of Computer Science, Stanford University, Stanford, CA;Department of Computer Science, Stanford University, Stanford, CA;Department of Computer Science, Stanford University, Stanford, CA;Department of Computer Science, Stanford University, Stanford, CA;Department of Computer Science, Stanford University, Stanford, CA;Department of Computer Science, Stanford University, Stanford, CA

  • Venue:
  • SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
  • Year:
  • 1997

Quantified Score

Hi-index 0.02

Visualization

Abstract

A data warehouse is a repository of integrated information from distributed, autonomous, and possibly heterogeneous, sources. In effect, the warehouse stores one or more materialized views of the source data. The data is then readily available to user applications for querying and analysis. Figure 1 shows the basic architecture of a warehouse: data is collected from each source, integrated with data from other sources, and stored at the warehouse. Users then access the data directly from the warehouse.As suggested by Figure 1, there are two major components in a warehouse system: the integration component, responsible for collecting and maintaining the materialized views, and the query and analysis component, responsible for fulfilling the information needs of specific end users. Note that the two components are not independent. For example, which views the integration component materializes depends on the expected needs of end users.Most current commercial warehousing systems (e.g., Redbrick, Sybase, Arbor) focus on the query and analysis component, providing specialized index structures at the warehouse and extensive querying facilities for the end user. In the WHIPS (WareHousing Information Project at Stanford) project, on the other hand, we focus on the integration component. In particular, we have developed an architecture and implemented a prototype for identifying data changes at heterogeneous sources, transforming them and summarizing them in accordance to warehouse specifications, and incrementally integrating them into the warehouse. We propose to demonstrate our prototype at SIGMOD, illustrating the main features of our architecture. Our architecture is modular and we designed it specifically to fulfill several important and interrelated goals: data sources and warehouse views can be added and removed dynamically; it is scalable by adding more internal modules; changes at the sources are detected automatically; the warehouse may be updated continuously as the sources change, without requiring “down time;” and the warehouse is always kept consistent with the source data by the integration algorithms. More details on these goals and how we achieve them are provided in [WGL+96].