Lightweight natural language text compression

Authors:
Nieves R. Brisaboa;Antonio Fariña;Gonzalo Navarro;José R. Paramá
Affiliations:
Database Lab., Univ. da Coruña, Facultade de Informática, A Coruña, Spain 15071;Database Lab., Univ. da Coruña, Facultade de Informática, A Coruña, Spain 15071;Center for Web Research, Dept. of Computer Science, Univ. de Chile, Blanco Encalada, Santiago, Chile 2120;Database Lab., Univ. da Coruña, Facultade de Informática, A Coruña, Spain 15071
Venue:
Information Retrieval
Year:
2007

Citing 0
Cited 23

Reorganizing compressed text

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
New adaptive compressors for natural language text

Software—Practice & Experience
Self-indexing Natural Language

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Rank and Select for Succinct Data Structures

Electronic Notes in Theoretical Computer Science (ENTCS)
A Two-Level Structure for Compressing Aligned Bitexts

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Directly Addressable Variable-Length Codes

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Improving semistatic compression via pair-based coding

PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
Dynamic lightweight text compression

ACM Transactions on Information Systems (TOIS)
A compressed self-indexed representation of XML documents

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
Index structures for efficiently searching natural language text

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
An efficient implementation of a flexible XPath extension

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Improving semistatic compression via phrase-based modeling

Information Processing and Management: an International Journal
Natural Language Compression on Edge-Guided text preprocessing

Information Sciences: an International Journal
Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections

Proceedings of the VLDB Endowment
Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
Wavelet trees for all

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
ODC: Frame for definition of Dense codes

European Journal of Combinatorics
Generalized biwords for bitext compression and translation spotting

Journal of Artificial Intelligence Research
DACs: Bringing direct access to variable-length codes

Information Processing and Management: an International Journal
Ranked document retrieval in (almost) no space

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Implicit indexing of natural language text by reorganizing bytecodes

Information Retrieval
Generalized biwords for bitext compression and translation spotting: extended abstract

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Wavelet trees for all

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in exchange for producing around 11% larger compressed files. This work describes End-Tagged Dense Code and (s, c)-Dense Code, two new semistatic statistical methods for compressing natural language texts. These techniques permit simpler and faster encoding and obtain better compression ratios than Tagged Huffman Code, while maintaining its fast direct search and random access capabilities. We show that Dense Codes improve Tagged Huffman Code compression ratio by about 10%, reaching only 0.6% overhead over the optimal Huffman compression ratio. Being simpler, Dense Codes are generated 45% to 60% faster than Huffman codes. This makes Dense Codes a very attractive alternative to Huffman code variants for various reasons: they are simpler to program, faster to build, of almost optimal size, and as fast and easy to search as the best Huffman variants, which are not so close to the optimal size.