I/O-Efficient Compressed Text Indexes: From Theory to Practice

  • Authors:
  • Sheng-Yuan Chiu;Wing-Kai Hon;Rahul Shah;Jeffrey Scott Vitter

  • Affiliations:
  • -;-;-;-

  • Venue:
  • DCC '10 Proceedings of the 2010 Data Compression Conference
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Pattern matching on text data has been a fundamental field ofComputer Science for nearly 40 years. Databases supporting full-textindexing functionality on text data are now widely used by biologists.In the theoretical literature, the most popular internal-memory index structures are thesuffix trees and the suffix arrays, and the most popular external-memory index structureis the string B-tree. However, the practical applicabilityof these indexes has been limited mainly because of their spaceconsumption and I/O issues. These structures use a lot more space(almost 20 to 50 times more) than the original text dataand are often disk-resident.Ferragina and Manzini (2005) and Grossi and Vitter (2005)gave the first compressed text indexes with efficient query times inthe internal-memory model. Recently, Chien et al (2008) presenteda compact text index in the external memory based on theconcept of Geometric Burrows-Wheeler Transform.They also presented lower bounds which suggested that it may be hardto obtain a good index structure in the external memory.In this paper, we investigate this issue from a practical point of view.On the positive side we show an external-memory text indexingstructure (based on R-trees and KD-trees) that saves space by aboutan order of magnitude as compared to the standard String B-tree.While saving space, these structures also maintain a comparable I/O efficiency to thatof String B-tree.We also show various space vs I/O efficiency trade-offsfor our structures.