A Corpus for the Evaluation of Lossless Compression Algorithms

Authors:
Ross Arnold;Tim Bell
Affiliations:
-;-
Venue:
DCC '97 Proceedings of the Conference on Data Compression
Year:
1997

Citing 0
Cited 37

Universal Data Compression Based on the Burrows-Wheeler Transformation: Theory and Practice

IEEE Transactions on Computers
Searching Digital Music Libraries

ICADL '02 Proceedings of the 5th International Conference on Asian Digital Libraries: Digital Libraries: People, Knowledge, and Technology
Compact Suffix Array

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
String Matching with Stopper Encoding and Code Splitting

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
A Dynamic Data Structure for Reverse Lexicographically Sorted Prefixes

CPM '99 Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching
FPGA-Based Modelling Unit for High Speed Lossless Arithmetic Coding

FPL '01 Proceedings of the 11th International Conference on Field-Programmable Logic and Applications
On the Performance of BWT Sorting Algorithms

DCC '00 Proceedings of the Conference on Data Compression
Space-Time Tradeoffs in the Inverse B-W Transform

DCC '01 Proceedings of the Data Compression Conference
Design and Implementation of a Lossless Parallel High-Speed Data Compression System

IEEE Transactions on Parallel and Distributed Systems
Searching digital music libraries

Information Processing and Management: an International Journal - Special issue: An Asian digital libraries perspective
Alternative source coding model for mobile text communication

Proceedings of the 2005 ACM symposium on Applied computing
A Configurable Statistical Lossless Compression Core Based on Variable Order Markov Modeling and Arithmetic Coding

IEEE Transactions on Computers
An analysis of XML compression efficiency

Proceedings of the 2007 workshop on Experimental computer science
An analysis of XML binary formats and compression

ecs'07 Experimental computer science on Experimental computer science
Efficient Algorithms for the Inverse Sort Transform

IEEE Transactions on Computers
Evolutionary lossless compression with GP-ZIP*

Proceedings of the 10th annual conference on Genetic and evolutionary computation
Compression of small text files

Advanced Engineering Informatics
TinyLex: static n-gram index pruning with perfect recall

Proceedings of the 17th ACM conference on Information and knowledge management
Stateful hardware decompression in networking environment

Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Hash Functions Based on Large Quasigroups

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
An Application of Self-organizing Data Structures to Compression

SEA '09 Proceedings of the 8th International Symposium on Experimental Algorithms
On prediction using variable order Markov models

Journal of Artificial Intelligence Research
Dynamic Edit Distance Table under a General Weighted Cost Function

SOFSEM '10 Proceedings of the 36th Conference on Current Trends in Theory and Practice of Computer Science
PPM with the extended alphabet

Information Sciences: an International Journal
Post BWT stages of the Burrows–Wheeler compression algorithm

Software—Practice & Experience
A compact representation of nondeterministic (suffix) automata for the bit-parallel approach

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Evolution of human-competitive lossless compression algorithms with GP-zip2

Genetic Programming and Evolvable Machines
Mapping words into codewords on PPM

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Suffix tree based data compression

SOFSEM'05 Proceedings of the 31st international conference on Theory and Practice of Computer Science
Searching for smallest grammars on large sequences and application to DNA

Journal of Discrete Algorithms
Suppressing redundancy in wireless sensor network traffic

DCOSS'10 Proceedings of the 6th IEEE international conference on Distributed Computing in Sensor Systems
Choosing word occurrences for the smallest grammar problem

LATA'10 Proceedings of the 4th international conference on Language and Automata Theory and Applications
A fast and efficient nearly-optimal adaptive Fano coding scheme

Information Sciences: an International Journal
Improving evolved alphabet using tabu set

HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part I
Efficient computation of substring equivalence classes with suffix arrays

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
An effective heuristic for the smallest grammar problem

Proceedings of the 15th annual conference on Genetic and evolutionary computation
Adaptive Online Compression in Clouds--Making Informed Decisions in Virtual Machine Environments

Journal of Grid Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

A number of authors have used the Calgary corpus of texts to provide empirical results for lossless compression algorithms. This corpus was collected in 1987, although it was not published until 1990. The advances with compression algorithms have been achieving relatively small improvements in compression, measured using the Calgary corpus. There is a concern that algorithms are being fine-tuned to this corpus, and that small improvements measured in this way may not apply to other files. Furthermore, the corpus is almost ten years old, and over this period there have been changes in the kinds of files that are compressed, particularly with the development of the Internet, and the rapid growth of high-capacity secondary storage for personal computers. We explore the issues raised above, and develop a principled technique for collecting a corpus of test data for compression methods. A corpus, called the Canterbury corpus, is developed using this technique, and we report the performance of a collection of compression methods using the new corpus.