A Two-Level Structure for Compressing Aligned Bitexts

Authors:
Joaquín Adiego;Nieves R. Brisaboa;Miguel A. Martínez-Prieto;Felipe Sánchez-Martínez
Affiliations:
Dept. de Informática, Universidad de Valladolid, Spain;Database Lab, Universidade da Coruña, Spain;Dept. de Informática, Universidad de Valladolid, Spain;Dept. de Llenguatges i Sistemes Informàtics, Universitat d'Alacant, Spain
Venue:
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Year:
2009

Citing 10
Cited 3

Compression of parallel texts

Information Processing and Management: an International Journal - Special issue on data compression for images and texts
A fast string searching algorithm

Communications of the ACM
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
A systematic comparison of various statistical alignment models

Computational Linguistics
PPM: One Step to Practicality

DCC '02 Proceedings of the Data Compression Conference
Parallel texts

Natural Language Engineering
Lightweight natural language text compression

Information Retrieval
Compressed full-text indexes

ACM Computing Surveys (CSUR)
On the Use of Word Alignments to Enhance Bitext Compression

DCC '09 Proceedings of the 2009 Data Compression Conference

Improved alignment based algorithm for multilingual text compression

LATA'11 Proceedings of the 5th international conference on Language and automata theory and applications
Generalized biwords for bitext compression and translation spotting

Journal of Artificial Intelligence Research
Generalized biwords for bitext compression and translation spotting: extended abstract

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

A bitext , or bilingual parallel corpus , consists of two texts, each one in a different language, that are mutual translations. Bitexts are very useful in linguistic engineering because they are used as source of knowledge for different purposes. In this paper we propose a strategy to efficiently compress and use bitexts, saving, not only space, but also processing time when exploiting them. Our strategy is based on a two-level structure for the vocabularies, and on the use of biwords , a pair of associated words, one from each language, as basic symbols to be encoded with an ETDC [2] compressor. The resulting compressed bitext needs around 20% of the space and allows more efficient implementations of the different types of searches and operations that linguistic engineerings need to perform on them. In this paper we discuss and provide results for compression, decompression, different types of searches, and bilingual snippets extraction.