Experiments with persian text compression for web

Authors:
Farhad Oroumchian;Ehsan Darrudi;Fattane Taghiyareh;Neeyaz Angoshtari
Affiliations:
University of Wollongong in Dubai, Dubai, UAE;University of Tehran, Tehran, IRAN;University of Tehran, Tehran, IRAN;University of Southern California, Los Angeles, CA
Venue:
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Year:
2004

Citing 3
Cited 4

Adaptation in natural and artificial systems

Adaptation in natural and artificial systems
The Unicode standard, version 2.0

The Unicode standard, version 2.0
Data compression via textual substitution

Journal of the ACM (JACM)

Experiments with English-Persian text retrieval

Proceedings of the 2nd ACM workshop on Improving non english web searching
Hamshahri: A standard Persian text collection

Knowledge-Based Systems
Fusion of retrieval models at CLEF 2008 ad hoc Persian track

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Evolution of human-competitive lossless compression algorithms with GP-zip2

Genetic Programming and Evolvable Machines

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increasing importance of Unicode for text encoding implies a possible doubling of data storage space and data transmission time, with a corresponding need for data compression. The approach presented in this paper aims to reduce the storage and the transmission time for Persian text files in web-based applications and Internet. The basic idea here is to compute the most repetitive n-grams in the Persian text and replace them by a single character in the user-defined sections of the Unicode. The compression will be done on the server side once and the decompression process is eliminated completely. The rendering process in the browser will do the decompression. There is no need for any additional program or add-ins for decompression to be installed on the browser or client side. The user needs only to download the proper Unicode font once. A genetic algorithm is utilized to select the most appropriate n-grams. In the best case, we have achieved 52.26 % reduction of the file size. The method is general, and applies equally well to English and other languages.