Information Processing and Management: an International Journal - Special issue on data compression for images and texts
A fast string searching algorithm
Communications of the ACM
Information Retrieval: Computational and Theoretical Aspects
Information Retrieval: Computational and Theoretical Aspects
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
A systematic comparison of various statistical alignment models
Computational Linguistics
DCC '02 Proceedings of the Data Compression Conference
Natural Language Engineering
Lightweight natural language text compression
Information Retrieval
ACM Computing Surveys (CSUR)
On the Use of Word Alignments to Enhance Bitext Compression
DCC '09 Proceedings of the 2009 Data Compression Conference
Improved alignment based algorithm for multilingual text compression
LATA'11 Proceedings of the 5th international conference on Language and automata theory and applications
Generalized biwords for bitext compression and translation spotting
Journal of Artificial Intelligence Research
Generalized biwords for bitext compression and translation spotting: extended abstract
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Hi-index | 0.00 |
A bitext , or bilingual parallel corpus , consists of two texts, each one in a different language, that are mutual translations. Bitexts are very useful in linguistic engineering because they are used as source of knowledge for different purposes. In this paper we propose a strategy to efficiently compress and use bitexts, saving, not only space, but also processing time when exploiting them. Our strategy is based on a two-level structure for the vocabularies, and on the use of biwords , a pair of associated words, one from each language, as basic symbols to be encoded with an ETDC [2] compressor. The resulting compressed bitext needs around 20% of the space and allows more efficient implementations of the different types of searches and operations that linguistic engineerings need to perform on them. In this paper we discuss and provide results for compression, decompression, different types of searches, and bilingual snippets extraction.