Efficient construction of FM-index using overlapping block processing for large scale texts

Authors:
Di Zhang;Yunquan Zhang;Jing Chen
Affiliations:
Institute of Software, Chinese Academy of Sciences;Institute of Software, Chinese Academy of Sciences and State Key Laboratory of Computer Science;Microsoft Research Asia
Venue:
ECIR'07 Proceedings of the 29th European conference on IR research
Year:
2007

Citing 6
Cited 0

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
An experimental study of an opportunistic index

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Succinct suffix arrays based on run-length encoding

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching

Quantified Score

Hi-index	0.00

Visualization

Abstract

In previous implementations of FM-index, the construction algorithms usually need several times larger memory than text size. Sometimes the memory requirement prevents the FM-index from being employed in processing large scale texts. In this paper, we design an approach to constructing FM-index based on overlapping block processing. It can build the FM-index in linear time and constant temporary memory space, especially suitable for large scale texts. Instead of loading and indexing text as a whole, the new approach splits the text into blocks of fixed size, and then indexes them respectively. To assure the correctness and effectiveness of query operation, before indexing, we further append certain length of succeeding characters to the end of each block. The experimental results show that, with a slight loss on the compression ratio and query performance, our implementation provides a faster and more flexible solution for the problem of construction efficiency.