Online maintenance of very large random samples on flash storage

Authors:
Suman Nath;Phillip B. Gibbons
Affiliations:
Microsoft Research;Intel Research Pittsburgh
Venue:
Proceedings of the VLDB Endowment
Year:
2008

Citing 21
Cited 20

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
An efficient algorithm for sequential random sampling

ACM Transactions on Mathematical Software (TOMS)
Skip lists: a probabilistic alternative to balanced trees

Communications of the ACM
Random sampling from hash files

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
The log-structured merge-tree (LSM-tree)

Acta Informatica
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
External memory algorithms and data structures: dealing with massive data

ACM Computing Surveys (CSUR)
Overcoming Limitations of Sampling for Aggregation Queries

Proceedings of the 17th International Conference on Data Engineering
A Novel Index Supporting High Volume Data Warehouse Insertion

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Dynamic sample selection for approximate query processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
An efficient R-tree implementation over flash-memory storage systems

GIS '03 Proceedings of the 11th ACM international symposium on Advances in geographic information systems
Online maintenance of very large random samples

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Capsule: an energy-optimized object storage system for memory-constrained sensor devices

Proceedings of the 4th international conference on Embedded networked sensor systems
FlashDB: dynamic self-tuning database for NAND flash

Proceedings of the 6th international conference on Information processing in sensor networks
A design for high-performance flash disks

ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
Design of flash-based DBMS: an in-page logging approach

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Microhash: an efficient index structure for fash-based sensor devices

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Storage alternatives for mobile computers

OSDI '94 Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation
Block recycling schemes and their cost-based optimization in nand flash memory based storage system

EMSOFT '07 Proceedings of the 7th ACM & IEEE international conference on Embedded software
BPLRU: a buffer management scheme for improving random writes in flash storage

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Design tradeoffs for SSD performance

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference

Query processing techniques for solid state drives

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
FlashLogging: exploiting flash devices for synchronous logging performance

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
GAMPS: compressing multi sensor data by grouping and amplitude scaling

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Energy efficient sensor data logging with amnesic flash storage

IPSN '09 Proceedings of the 2009 International Conference on Information Processing in Sensor Networks
FAWN: a fast array of wimpy nodes

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
PR-join: a non-blocking join achieving higher early result rate with statistical guarantees

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Page-differential logging: an efficient and DBMS-independent approach for storing data into flash memory

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Recovery of flash memories for reliable mobile storages

Mobile Information Systems
Cheap and large CAMs for high performance data-intensive networked systems

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
ChunkStash: speeding up inline storage deduplication using flash memory

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
FlashStore: high throughput persistent key-value store

Proceedings of the VLDB Endowment
FAWN: a fast array of wimpy nodes

Communications of the ACM
SkimpyStash: RAM space skimpy key-value store on flash-based storage

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Data management over flash memory

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
An FTL-agnostic layer to improve random write on flash memory

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
FAST: a generic framework for flash-aware spatial trees

SSTD'11 Proceedings of the 12th international conference on Advances in spatial and temporal databases
Designing a flash-aware two-level cache

ADBIS'11 Proceedings of the 15th international conference on Advances in databases and information systems
SILT: a memory-efficient, high-performance key-value store

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
HybridStore: an efficient data management system for hybrid flash-based sensor devices

EWSN'13 Proceedings of the 10th European conference on Wireless Sensor Networks
Generic and efficient framework for search trees on flash memory storage systems

Geoinformatica

Quantified Score

Hi-index	0.02

Visualization

Abstract

Recent advances in flash media have made it an attractive alternative for data storage in a wide spectrum of computing devices, such as embedded sensors, mobile phones, PDA's, laptops, and even servers. However, flash media has many unique characteristics that make existing data management/analytics algorithms designed for magnetic disks perform poorly with flash storage. For example, while random (page) reads are as fast as sequential reads, random (page) writes and in-place data updates are orders of magnitude slower than sequential writes. In this paper, we consider an important fundamental problem that would seem to be particularly challenging for flash storage: efficiently maintaining a very large (100 MBs or more) random sample of a data stream (e.g., of sensor readings). First, we show that previous algorithms such as reservoir sampling and geometric file are not readily adapted to flash. Second, we propose B-FILE, an energy-efficient abstraction for flash media to store self-expiring items, and show how a B-FILE can be used to efficiently maintain a large sample in flash. Our solution is simple, has a small (RAM) memory footprint, and is designed to cope with flash constraints in order to reduce latency and energy consumption. Third, we provide techniques to maintain biased samples with a B-FILE and to query the large sample stored in a B-FILE for a subsample of an arbitrary size. Finally, we present an evaluation with flash media that shows our techniques are several orders of magnitude faster and more energy-efficient than (flash-friendly versions of) reservoir sampling and geometric file. A key finding of our study, of potential use to many flash algorithms beyond sampling, is that "semi-random" writes (as defined in the paper) on flash cards are over two orders of magnitude faster and more energy-efficient than random writes.