A cache filtering optimisation for queries to massive datasets on tertiary storage

  • Authors:
  • Koen Holtman;Peter van der Stok;Ian Willers

  • Affiliations:
  • CERN - EP, division CH - 1211 Geneva 23, Switzerland;Eindhoven University of Technology, Postbus 513, 5600 MB Eindhoven, The Netherlands;CERN - EP, division CH - 1211 Geneva 23, Switzerland

  • Venue:
  • Proceedings of the 2nd ACM international workshop on Data warehousing and OLAP
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider a system in which many users run queries to examine subsets of a large object set. The object set is partitioned into files on tape. A single subset of objects will be visited by multiple queries in the workload. This locality of access creates the opportunity for caching on disk. We introduce and evaluate a novel optimisation, cache filtering, in which the 'hot' objects are automatically extracted from the files that are staged on disk, and then cached separately in new files on disk. Cache filtering can lead to complex situations in the disk cache. We show that these do not prevent effective caching and we introduce a special cache replacement algorithm to maximise efficiency. Through simulations we evaluate the system over a broad range of likely workloads. Depending on workload and system parameters, the cache filtering optimisation yields speedup factors up to 6.