On the Use of Word Alignments to Enhance Bitext Compression

  • Authors:
  • Miguel A. Martínez-Prieto;Joaquín Adiego;Felipe Sánchez-Martínez;Pablo de la Fuente;Rafael C. Carrasco

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • DCC '09 Proceedings of the 2009 Data Compression Conference
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes a novel approach for bilingual parallel corpora (bitexts) compression. The approach takes advantage of the fact that the two texts that form a bitext are mutual translations. First, the two texts are aligned both at the sentence and the word level. Then, word alignments are used to define biwords, that is, pairs of two words, each one from a different text, that are mutual translations. Finally, a biword-based PPM compressor is applied. The results obtained compressing the two texts of the bitext together improve the compression ratios achieved when both texts are independently compressed through a word-based PPM compressor; thus, saving storage and transmission costs.