Compression, indexing, and retrieval for massive string data
CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Data structures: time, I/Os, entropy, joules!
ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Computing lempel-ziv factorization online
MFCS'12 Proceedings of the 37th international conference on Mathematical Foundations of Computer Science
Hi-index | 0.00 |
Pattern matching on text data has been a fundamental field ofComputer Science for nearly 40 years. Databases supporting full-textindexing functionality on text data are now widely used by biologists.In the theoretical literature, the most popular internal-memory index structures are thesuffix trees and the suffix arrays, and the most popular external-memory index structureis the string B-tree. However, the practical applicabilityof these indexes has been limited mainly because of their spaceconsumption and I/O issues. These structures use a lot more space(almost 20 to 50 times more) than the original text dataand are often disk-resident.Ferragina and Manzini (2005) and Grossi and Vitter (2005)gave the first compressed text indexes with efficient query times inthe internal-memory model. Recently, Chien et al (2008) presenteda compact text index in the external memory based on theconcept of Geometric Burrows-Wheeler Transform.They also presented lower bounds which suggested that it may be hardto obtain a good index structure in the external memory.In this paper, we investigate this issue from a practical point of view.On the positive side we show an external-memory text indexingstructure (based on R-trees and KD-trees) that saves space by aboutan order of magnitude as compared to the standard String B-tree.While saving space, these structures also maintain a comparable I/O efficiency to thatof String B-tree.We also show various space vs I/O efficiency trade-offsfor our structures.