Generalized biwords for bitext compression and translation spotting

Authors:
Felipe Sánchez-Martínez;Rafael C. Carrasco;Miguel A. Martínez-Prieto;Joaquín Adiego
Affiliations:
Departament de Llenguatges i Sistemes Informàtics, Universitat d'Alacant, Alacant, Spain;Departament de Llenguatges i Sistemes Informàtics, Universitat d'Alacant, Alacant, Spain;Departamento de Informática, Universidad de Valladolid, Valladolid, Spain;Departamento de Informática, Universidad de Valladolid, Valladolid, Spain
Venue:
Journal of Artificial Intelligence Research
Year:
2012

Citing 42
Cited 1

Word-based text compression

Software—Practice & Experience
Text compression

Text compression
A statistical approach to machine translation

Computational Linguistics
Elements of information theory

Elements of information theory
Compression of parallel texts

Information Processing and Management: an International Journal - Special issue on data compression for images and texts
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Compact pat trees

Compact pat trees
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A fast string searching algorithm

Communications of the ACM
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
Compression: A Key for Next-Generation Text Retrieval Systems

Computer
A systematic comparison of various statistical alignment models

Computational Linguistics
PPM: One Step to Practicality

DCC '02 Proceedings of the Data Compression Conference
Empirical methods for exploiting parallel texts

Empirical methods for exploiting parallel texts
A program for aligning sentences in bilingual corpora

Computational Linguistics - Special issue on using large corpora: I
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Introduction to the special issue on word sense disambiguation: the state of the art

Computational Linguistics - Special issue on word sense disambiguation
Aligning sentences in parallel corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
HMM-based word alignment in statistical translation

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval)

Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval)
Parallel texts

Natural Language Engineering
Machine Translation with Inferred Stochastic Finite-State Transducers

Computational Linguistics
Translation spotting for translation memories

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Data Compression: The Complete Reference

Data Compression: The Complete Reference
Lightweight natural language text compression

Information Retrieval
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Scaling phrase-based statistical machine translation to larger corpora and longer phrases

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
N-gram-based Machine Translation

Computational Linguistics
Statistical machine translation

ACM Computing Surveys (CSUR)
On the Use of Word Alignments to Enhance Bitext Compression

DCC '09 Proceedings of the 2009 Data Compression Conference
Triplet lexicon models for statistical machine translation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A Two-Level Structure for Compressing Aligned Bitexts

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Directly Addressable Variable-Length Codes

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Bilingual concordancers and translation memories: a comparative evaluation

LRTWRT '04 Proceedings of the Second International Workshop on Language Resources for Translation Work, Research and Training
Word-based text compression using the Burrows-Wheeler transform

Information Processing and Management: an International Journal
Statistical Machine Translation

Statistical Machine Translation
Modelling Parallel Texts for Boosting Compression

DCC '10 Proceedings of the 2010 Data Compression Conference
TransSearch: from a bilingual concordancer to a translation finder

Machine Translation
Mapping words into codewords on PPM

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Wider context by using bilingual language models in machine translation

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory

Generalized biwords for bitext compression and translation spotting: extended abstract

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large bilingual parallel texts (also known as bitexts) are usually stored in a compressed form, and previous work has shown that they can be more efficiently compressed if the fact that the two texts are mutual translations is exploited. For example, a bitext can be seen as a sequence of biwords --pairs of parallel words with a high probability of cooccurrence-- that can be used as an intermediate representation in the compression process. However, the simple biword approach described in the literature can only exploit one-to-one word alignments and cannot tackle the reordering of words. We therefore introduce a generalization of biwords which can describe multi-word expressions and reorderings. We also describe some methods for the binary compression of generalized biword sequences, and compare their performance when different schemes are applied to the extraction of the biword sequence. In addition, we show that this generalization of biwords allows for the implementation of an efficient algorithm to look on the compressed bitext for words or text segments in one of the texts and retrieve their counterpart translations in the other text --an application usually referred to as translation spotting-- with only some minor modifications in the compression algorithm.