Modelling Parallel Texts for Boosting Compression

  • Authors:
  • Joaquín Adiego;Miguel A. Martínez-Prieto;Javier E. Hoyos-Torío;Felipe Sánchez-Martínez

  • Affiliations:
  • -;-;-;-

  • Venue:
  • DCC '10 Proceedings of the 2010 Data Compression Conference
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Bilingual parallel corpora, also know as bitexts, convey the same information in two different languages. This implies that when modelling bitexts one can take advantage of the fact that there exists a relation between both texts; the text alignment task allow to establish such relationship. In this paper we propose different approaches that use words and biwords (pairs made of two words, each one from a different text) as representation symbolic units. The properties of these approaches are analyzed from a statistical point of view and tested as a preprocessing step to general purpose compressors. The results obtained suggest interesting conclusions concerning the use of both words and biwords. When encoded models are used as compression boosters we achieve compression ratios improving state-of-the-art compressors up to 6.5 percentage points, being up to 40% faster.