An efficient compression code for text databases

Authors:
Nieves R. Brisaboa;Eva L. Iglesias;Gonzalo Navarro;José R. Paramá
Affiliations:
Database Lab., Univ. da Coruña, Facultade de Informática, A Coruña, Spain;Computer Science Dept., Univ. de Vigo, Escola Superior de Enxeñería Informática, Ourense, Spain;Dept. of Computer Science, Univ. de Chile, Santiago, Chile;Database Lab., Univ. da Coruña, Facultade de Informática, A Coruña, Spain
Venue:
ECIR'03 Proceedings of the 25th European conference on IR research
Year:
2003

Citing 9
Cited 19

Word-based text compression

Software—Practice & Experience
Text compression

Text compression
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
Modern Information Retrieval

Modern Information Retrieval
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
Adding Compression to Block Addressing Inverted Indexes

Information Retrieval
On Lower Bounds for the Redundancy of Optimal Codes

Designs, Codes and Cryptography
Compression: A Key for Next-Generation Text Retrieval Systems

Computer
On the implementation of minimum-redundancy prefix codes

DCC '96 Proceedings of the Conference on Data Compression

Efficiently decodable and searchable natural language adaptive compression

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A general compression algorithm that supports fast searching

Information Processing Letters
New technique for data compression

SEPADS'05 Proceedings of the 4th WSEAS International Conference on Software Engineering, Parallel & Distributed Systems
Improved Variable-to-Fixed Length Codes

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Rank and Select for Succinct Data Structures

Electronic Notes in Theoretical Computer Science (ENTCS)
Simple Random Access Compression

Fundamenta Informaticae
Fast and Flexible Compression for Web Search Engines

Electronic Notes in Theoretical Computer Science (ENTCS)
The strategy design of compression and transmission on cGML spatial data and its application in LBS

WiCOM'09 Proceedings of the 5th International Conference on Wireless communications, networking and mobile computing
Improving semistatic compression via pair-based coding

PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
Simple compression code supporting random access and fast string matching

WEA'07 Proceedings of the 6th international conference on Experimental algorithms
Training parse trees for efficient VF coding

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
On improving Tunstall codes

Information Processing and Management: an International Journal
Phrase-Based pattern matching in compressed text

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Efficient compression of text attributes of data warehouse dimensions

DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
Compressing dynamic text collections via phrase-based coding

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
Enhanced byte codes with restricted prefix properties

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Simple Random Access Compression

Fundamenta Informaticae
ODC: Frame for definition of Dense codes

European Journal of Combinatorics
Practical fixed length Lempel-Ziv coding

Discrete Applied Mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a new compression format for natural language texts, allowing both exact and approximate search without decompression. This new code -called End-Tagged Dense Code- has some advantages with respect to other compression techniques with similar features such as the Tagged Huffman Code of [Moura et al., ACM TOIS 2000]. Our compression method obtains (i) better compression ratios, (ii) a simpler vocabulary representation, and (iii) a simpler and faster encoding. At the same time, it retains the most interesting features of the method based on the Tagged Huffman Code, i.e., exact search for words and phrases directly on the compressed text using any known sequential pattern matching algorithm, efficient word-based approximate and extended searches without any decoding, and efficient decompression of arbitrary portions of the text. As a side effect, our analytical results give new upper and lower bounds for the redundancy of d-ary Huffman codes.