Software—Practice & Experience
A very fast substring search algorithm
Communications of the ACM
Fast and flexible word searching on compressed text
ACM Transactions on Information Systems (TOIS)
A fast string searching algorithm
Communications of the ACM
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
An efficient compression code for text databases
ECIR'03 Proceedings of the 25th European conference on IR research
A universal algorithm for sequential data compression
IEEE Transactions on Information Theory
Compression of individual sequences via variable-rate coding
IEEE Transactions on Information Theory
Hi-index | 0.00 |
In this paper we present the adaptation of a compression technique, specially designed to compress large textual databases, to the peculiarities of web search engines. The (s,c)-Dense Code belongs to a new category of compression techniques [Silva de Moura, E., G. Navarro, N. Ziviani and R. Baeza-Yates, Fast and flexible word searching on compressed text, ACM Transactions on Information Systems 18 (2000), pp. 113-139; Brisaboa, N., A. Farina, G. Navarro and M. Esteller, (s,c)-dense coding: An optimized compression code for natural language text databases, in: Proc. 10^t^h International Symposium on String Processing and Information Retrieval (SPIRE 2003), LNCS 2857, 2003, pp. 122-136] that allows fast and flexible search directly on compressed files. However these methods are only suitable for large natural texts containing at least 1 megabyte, otherwise they would not achieve an attractive amount of compression. In order to take advantage of the search capabilities of these techniques (they allow searches on compressed files up to eight times faster than searching on the plain versions [Silva de Moura, E., G. Navarro, N. Ziviani and R. Baeza-Yates, Fast and flexible word searching on compressed text, ACM Transactions on Information Systems 18 (2000), pp. 113-139]), we present a modification of the basic compression technique (s,c)-Dense Code to achieve reasonable compression ratios with small files, a requirement when we work with search engines.