Online maintenance of very large random samples on flash storage

  • Authors:
  • Suman Nath;Phillip B. Gibbons

  • Affiliations:
  • Microsoft Research, Redmond, USA;Intel Labs Pittsburgh, Pittsburgh, USA

  • Venue:
  • The VLDB Journal — The International Journal on Very Large Data Bases
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent advances in flash storage have made it an attractive alternative for data storage in a wide spectrum of computing devices, such as embedded sensors, mobile phones, PDA's, laptops, and even servers. However, flash storage has many unique characteristics that make existing data management/analytics algorithms designed for magnetic disks perform poorly with flash storage. For example, while random reads can be nearly as fast as sequential reads, random writes and in-place data updates are orders of magnitude slower than sequential writes. In this paper, we consider an important fundamental problem that would seem to be particularly challenging for flash storage: efficiently maintaining a very large random sample of a data stream (e.g., of sensor readings). First, we show that previous algorithms such as reservoir sampling and geometric file are not readily adapted to flash. Second, we propose B-File, an energy-efficient abstraction for flash storage to store self-expiring items, and show how a B-File can be used to efficiently maintain a large sample in flash. Our solution is simple, has a small (RAM) memory footprint, and is designed to cope with flash constraints in order to reduce latency and energy consumption. Third, we provide techniques to maintain biased samples with a B-File and to query the large sample stored in a B-File for a subsample of an arbitrary size. Finally, we present an evaluation with flash storage that shows our techniques are several orders of magnitude faster and more energy-efficient than (flash-friendly versions of) reservoir sampling and geometric file. A key finding of our study, of potential use to many flash algorithms beyond sampling, is that "semi-random" writes (as defined in the paper) on flash cards are over two orders of magnitude faster and more energy-efficient than random writes.