Online maintenance of very large random samples on flash storage

Authors:
Suman Nath;Phillip B. Gibbons
Affiliations:
Microsoft Research, Redmond, USA;Intel Labs Pittsburgh, Pittsburgh, USA
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2010

Citing 23
Cited 3

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
An efficient algorithm for sequential random sampling

ACM Transactions on Mathematical Software (TOMS)
Skip lists: a probabilistic alternative to balanced trees

Communications of the ACM
Random sampling from hash files

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
The log-structured merge-tree (LSM-tree)

Acta Informatica
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
External memory algorithms and data structures: dealing with massive data

ACM Computing Surveys (CSUR)
Overcoming Limitations of Sampling for Aggregation Queries

Proceedings of the 17th International Conference on Data Engineering
A Novel Index Supporting High Volume Data Warehouse Insertion

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
PicoDMBS: Scaling Down Database Techniques for the Smartcard

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Dynamic sample selection for approximate query processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
An efficient R-tree implementation over flash-memory storage systems

GIS '03 Proceedings of the 11th ACM international symposium on Advances in geographic information systems
Online maintenance of very large random samples

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Capsule: an energy-optimized object storage system for memory-constrained sensor devices

Proceedings of the 4th international conference on Embedded networked sensor systems
FlashDB: dynamic self-tuning database for NAND flash

Proceedings of the 6th international conference on Information processing in sensor networks
A design for high-performance flash disks

ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
Design of flash-based DBMS: an in-page logging approach

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Microhash: an efficient index structure for fash-based sensor devices

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Storage alternatives for mobile computers

OSDI '94 Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation
Block recycling schemes and their cost-based optimization in nand flash memory based storage system

EMSOFT '07 Proceedings of the 7th ACM & IEEE international conference on Embedded software
BPLRU: a buffer management scheme for improving random writes in flash storage

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Design tradeoffs for SSD performance

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
A space-efficient flash translation layer for CompactFlash systems

IEEE Transactions on Consumer Electronics

Data structures: time, I/Os, entropy, joules!

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Block storage virtualization with commodity secure digital cards

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
A group round robin based b-tree index storage scheme for flash memory devices

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent advances in flash storage have made it an attractive alternative for data storage in a wide spectrum of computing devices, such as embedded sensors, mobile phones, PDA's, laptops, and even servers. However, flash storage has many unique characteristics that make existing data management/analytics algorithms designed for magnetic disks perform poorly with flash storage. For example, while random reads can be nearly as fast as sequential reads, random writes and in-place data updates are orders of magnitude slower than sequential writes. In this paper, we consider an important fundamental problem that would seem to be particularly challenging for flash storage: efficiently maintaining a very large random sample of a data stream (e.g., of sensor readings). First, we show that previous algorithms such as reservoir sampling and geometric file are not readily adapted to flash. Second, we propose B-File, an energy-efficient abstraction for flash storage to store self-expiring items, and show how a B-File can be used to efficiently maintain a large sample in flash. Our solution is simple, has a small (RAM) memory footprint, and is designed to cope with flash constraints in order to reduce latency and energy consumption. Third, we provide techniques to maintain biased samples with a B-File and to query the large sample stored in a B-File for a subsample of an arbitrary size. Finally, we present an evaluation with flash storage that shows our techniques are several orders of magnitude faster and more energy-efficient than (flash-friendly versions of) reservoir sampling and geometric file. A key finding of our study, of potential use to many flash algorithms beyond sampling, is that "semi-random" writes (as defined in the paper) on flash cards are over two orders of magnitude faster and more energy-efficient than random writes.