Natural Language Compression on Edge-Guided text preprocessing

  • Authors:
  • Miguel A. Martínez-Prieto;Joaquín Adiego;Pablo de la Fuente

  • Affiliations:
  • Department of Computer Science, University of Valladolid, E.T.S. de Ingeniería Informática, Campus Miguel Delibes, 47011 Valladolid, Spain and Department of Computer Science, University ...;Department of Computer Science, University of Valladolid, E.T.S. de Ingeniería Informática, Campus Miguel Delibes, 47011 Valladolid, Spain;Department of Computer Science, University of Valladolid, E.T.S. de Ingeniería Informática, Campus Miguel Delibes, 47011 Valladolid, Spain

  • Venue:
  • Information Sciences: an International Journal
  • Year:
  • 2011

Quantified Score

Hi-index 0.07

Visualization

Abstract

This paper presents Edge-Guided (E-G), an optimized text preprocessing technique for compression purposes. It transforms the original text into a word net, which stores all relationships between adjoining words. A specific directed graph is proposed to model this transformation: words are stored in vertices, whereas edges represent word transitions. Thus, the word net has a text representation which reflects the natural word order in the text, so it can be used directly for encoding purposes. A specific coding scheme is described at the top of the word net. It regards a text as a sequence of word transitions, in such a way that each word is encoded by traversing a specific edge from the vertex which stores the preceding word. This accomplishes a 1-order text preprocessing whose output is an intermediate byte representation that can be effectively encoded with universal techniques. This technique is called E-G"1 and performs on some variants. This experience is used to revisit the concept of word net. It is used to identify significative 2-word symbols by performing a specific transformation on frequent edges. The resultant transformed word net appends these 2-word symbols to the original word vocabulary, and allows a specific hierarchical relationship between them and their component words. The transformed approach also enhances the original coding scheme to handle these new features. A new technique, called E-G"2, approximates a 2-order model on words that also support specific variants. Both techniques are studied from empirical and experimental perspectives. Some compressors are also used to analyze the preprocessing ability of E-G with respect to different compression approaches. Competitive space/time trade-offs are achieved when our approaches are used to compress medium-large size texts. The best results are achieved when E-G preprocessing is coupled with high-order compressors such as Prediction by Partial Matching (PPM).