Evaluating text preprocessing to improve compression on maillogs

Authors:
Fred Otten;Barry Irwin;Hannah Thinyane
Affiliations:
Rhodes University, Grahamstown, South Africa;Rhodes University, Grahamstown, South Africa;Rhodes University, Grahamstown, South Africa
Venue:
Proceedings of the 2009 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists
Year:
2009

Citing 4
Cited 0

The design and analysis of efficient lossless data compression systems

The design and analysis of efficient lossless data compression systems
Arithmetic coding for data compression

Communications of the ACM
PPM: One Step to Practicality

DCC '02 Proceedings of the Data Compression Conference
Security log management

Security log management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Maillogs contain important information about mail which has been sent or received. This information can be used for statistical purposes, to help prevent viruses or to help prevent SPAM. In order to satisfy regulations and follow good security practices, maillogs need to be monitored and archived. Since there is a large quantity of data, some form of data reduction is necessary. Data compression programs such as gzip and bzip2 are commonly used to reduce the quantity of data. Text preprocessing can be used to aid the compression of English text files. This paper evaluates whether text preprocessing, particularly word replacement, can be used to improve the compression of maillogs. It presents an algorithm for constructing a dictionary for word replacement and provides the results of experiments conducted using the ppmd, gzip, bzip2 and 7zip programs. These tests show that text preprocessing improves data compression on maillogs. Improvements of up to 56 percent in compression time and up to 32 percent in compression ratio are achieved. It also shows that a dictionary may be generated and used on other maillogs to yield reductions within half a percent of the results achieved for the maillog used to generate the dictionary.