A cache filtering optimisation for queries to massive datasets on tertiary storage

Authors:
Koen Holtman;Peter van der Stok;Ian Willers
Affiliations:
CERN - EP, division CH - 1211 Geneva 23, Switzerland;Eindhoven University of Technology, Postbus 513, 5600 MB Eindhoven, The Netherlands;CERN - EP, division CH - 1211 Geneva 23, Switzerland
Venue:
Proceedings of the 2nd ACM international workshop on Data warehousing and OLAP
Year:
1999

Citing 8
Cited 1

Efficient organization and access of multi-dimensional datasets on tertiary storage systems

Information Systems - Special issue: scientific databases
The five-minute rule ten years later, and other computer storage rules of thumb

ACM SIGMOD Record
Efficient Organization of Large Multidimensional Arrays

Proceedings of the Tenth International Conference on Data Engineering
Query Pre-Execution and Batching in Paradise: A Two-Pronged Approach to the Efficient Processing of Queries on Tape-Resident Raster Images

SSDBM '97 Proceedings of the Ninth International Conference on Scientific and Statistical Database Management
Determining the Optimal File Size on Tertiary Storage Systems Based on the Distribution of Query Sizes

SSDBM '98 Proceedings of the 10th International Conference on Scientific and Statistical Database Management
Caching and migration for multilevel persistent object stores

MSS '95 Proceedings of the 14th IEEE Symposium on Mass Storage Systems
Managing and serving a multiterabyte data set at the Fermilab DO experiment

MSS '95 Proceedings of the 14th IEEE Symposium on Mass Storage Systems
Automatic Reclustering of Objects in Very Large Databases for High Energy Physics

IDEAS '98 Proceedings of the 1998 International Symposium on Database Engineering & Applications

Managing a fragmented XML data cube with oracle and timesten

Proceedings of the fifteenth international workshop on Data warehousing and OLAP

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider a system in which many users run queries to examine subsets of a large object set. The object set is partitioned into files on tape. A single subset of objects will be visited by multiple queries in the workload. This locality of access creates the opportunity for caching on disk. We introduce and evaluate a novel optimisation, cache filtering, in which the 'hot' objects are automatically extracted from the files that are staged on disk, and then cached separately in new files on disk. Cache filtering can lead to complex situations in the disk cache. We show that these do not prevent effective caching and we introduce a special cache replacement algorithm to maximise efficiency. Through simulations we evaluate the system over a broad range of likely workloads. Depending on workload and system parameters, the cache filtering optimisation yields speedup factors up to 6.