Inverted files versus suffix arrays for locating patterns in primary memory

Authors:
Simon J. Puglisi;W. F. Smyth;Andrew Turpin
Affiliations:
Curtin University of Technology, Perth, Australia;Curtin University of Technology, Perth, Australia;RMIT University, Melbourne, Australia
Venue:
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Year:
2006

Citing 21
Cited 11

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Reducing the space requirement of suffix trees

Software—Practice & Experience
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Succinct representations of lcp information and improvements in the compressed suffix arrays

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Adding Compression to Block Addressing Inverted Indexes

Information Retrieval
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
Searching Large Lexicons for Partially Specified Terms using Compressed Inverted Files

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array

ISAAC '00 Proceedings of the 11th International Conference on Algorithms and Computation
Optimal Exact Strring Matching Based on Suffix Arrays

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
When indexing equals compression: experiments with compressing suffix arrays and applications

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
Improved Gapped Alignment in BLAST

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Succinct suffix arrays based on run-length encoding

Nordic Journal of Computing
GLIMPSE: a tool to search through entire file systems

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays

ISAAC'04 Proceedings of the 15th international conference on Algorithms and Computation

Output-sensitive autocompletion search

Information Retrieval
Algorithms and data structures for external memory

Foundations and Trends® in Theoretical Computer Science
Structural optimization of a full-text n-gram index using relational normalization

The VLDB Journal — The International Journal on Very Large Data Bases
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
Out of the Box Phrase Indexing

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Engineering basic algorithms of an in-memory text search engine

ACM Transactions on Information Systems (TOIS)
Top-k ranked document search in general text databases

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
String matching with alphabet sampling

Journal of Discrete Algorithms
Space-efficient algorithms for document retrieval

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent advances in the asymptotic resource costs of pattern matching with compressed suffix arrays are attractive, but a key rival structure, the compressed inverted file, has been dismissed or ignored in papers presenting the new structures. In this paper we examine the resource requirements of compressed suffix array algorithms against compressed inverted file data structures for general pattern matching in genomic and English texts. In both cases, the inverted file indexes q-grams, thus allowing full pattern matching capabilities, rather than simple word based search, making their functionality equivalent to the compressed suffix array structures. When using equivalent memory for the two structures, inverted files are faster at reporting the location of patterns when the number of occurrences of the patterns is high.