Experiments with persian text compression for web

  • Authors:
  • Farhad Oroumchian;Ehsan Darrudi;Fattane Taghiyareh;Neeyaz Angoshtari

  • Affiliations:
  • University of Wollongong in Dubai, Dubai, UAE;University of Tehran, Tehran, IRAN;University of Tehran, Tehran, IRAN;University of Southern California, Los Angeles, CA

  • Venue:
  • Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The increasing importance of Unicode for text encoding implies a possible doubling of data storage space and data transmission time, with a corresponding need for data compression. The approach presented in this paper aims to reduce the storage and the transmission time for Persian text files in web-based applications and Internet. The basic idea here is to compute the most repetitive n-grams in the Persian text and replace them by a single character in the user-defined sections of the Unicode. The compression will be done on the server side once and the decompression process is eliminated completely. The rendering process in the browser will do the decompression. There is no need for any additional program or add-ins for decompression to be installed on the browser or client side. The user needs only to download the proper Unicode font once. A genetic algorithm is utilized to select the most appropriate n-grams. In the best case, we have achieved 52.26 % reduction of the file size. The method is general, and applies equally well to English and other languages.