FEMTO: fast search of large sequence collections

Authors:
Michael P. Ferguson
Affiliations:
Laboratory for Telecommunications Sciences, College Park, Maryland
Venue:
CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Year:
2012

Citing 17
Cited 0

Fast text searching: allowing errors

Communications of the ACM
Fast text searching for regular expressions or automaton searching on tries

Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
A fast string searching algorithm

Communications of the ACM
Programming Techniques: Regular expression search algorithm

Communications of the ACM
An experimental study of a compressed index

Information Sciences: an International Journal - Dictionary based compression
When indexing equals compression: experiments with compressing suffix arrays and applications

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Indexing compressed text

Journal of the ACM (JACM)
Linear work suffix array construction

Journal of the ACM (JACM)
A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays

Algorithmica
Scalable parallel suffix array construction

Parallel Computing
Range Quantile Queries: Another Virtue of Wavelet Trees

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Compressed Suffix Arrays for Massive Data

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Top-k ranked document search in general text databases

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Lightweight BWT construction for very large string collections

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Lightweight Data Indexing and Compression in External Memory

Algorithmica - Special Issue: Theoretical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present FEMTO, a new system for indexing and searching large collections of sequence data. We used FEMTO to index and search three large collections, including one 182 GB collection. We compare the performance of FEMTO indexing and search with Bowtie and with Lucene, and we compare performance with indexes stored on hard disks and in flash memory. To our knowledge, we report on the first compressed suffix array storing more than 100 GB. Even for the largest collection, most searches completed in under 10 seconds.