A Two-Level Structure for Compressing Aligned Bitexts
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Improved alignment based algorithm for multilingual text compression
LATA'11 Proceedings of the 5th international conference on Language and automata theory and applications
Generalized biwords for bitext compression and translation spotting
Journal of Artificial Intelligence Research
Generalized biwords for bitext compression and translation spotting: extended abstract
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Hi-index | 0.00 |
This paper describes a novel approach for bilingual parallel corpora (bitexts) compression. The approach takes advantage of the fact that the two texts that form a bitext are mutual translations. First, the two texts are aligned both at the sentence and the word level. Then, word alignments are used to define biwords, that is, pairs of two words, each one from a different text, that are mutual translations. Finally, a biword-based PPM compressor is applied. The results obtained compressing the two texts of the bitext together improve the compression ratios achieved when both texts are independently compressed through a word-based PPM compressor; thus, saving storage and transmission costs.