Substring-based machine translation

Authors:
Graham Neubig;Taro Watanabe;Shinsuke Mori;Tatsuya Kawahara
Affiliations:
Nara Institute of Science and Technology, Ikoma, Japan;National Institute of Information and Communications Technology, Kyoto, Japan;Kyoto University, Kyoto, Japan;Kyoto University, Kyoto, Japan
Venue:
Machine Translation
Year:
2013

Citing 44
Cited 0

A systematic comparison of various statistical alignment models

Computational Linguistics
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Stochastic inversion transduction grammars and bilingual parsing of parallel corpora

Computational Linguistics
Machine transliteration

Computational Linguistics
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
Improving SMT quality with morpho-syntactic analysis

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
HMM-based word alignment in statistical translation

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Translating named entities using monolingual and bilingual resources

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
A parsing: fast exact Viterbi parse selection

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Cognates can improve statistical translation models

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A phrase-based, joint probability model for statistical machine translation

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
A joint source-channel model for machine transliteration

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Stochastic lexicalized inversion transduction grammar for alignment

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Modelling lexical redundancy for machine translation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Improving statistical MT through morphological analysis

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Alignment by agreement

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Sub-sentential alignment using substring co-occurrence counts

COLING ACL '06 Proceedings of the 21st International Conference on computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Sampling alignment structure under a Bayesian translation model

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Morphological analysis for statistical machine translation

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Can we translate letters?

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
English-to-Czech factored machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Improved statistical machine translation by multiple Chinese word segmentation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Optimizing Chinese word segmentation for machine translation performance

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
A Gibbs sampler for phrasal synchronous grammar induction

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Better word alignments with supervised ITG models

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Learning stochastic bracketing inversion transduction grammars with a cubic time biparsing algorithm

IWPT '09 Proceedings of the 11th International Conference on Parsing Technologies
Unsupervised tokenization for machine translation

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Inducing synchronous grammars with slice sampling

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Nonparametric word segmentation for machine translation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
An exponential translation model for target language morphology

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
An unsupervised model for joint phrase alignment and extraction

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Unsupervised bilingual morpheme segmentation and alignment with context-rich hidden semi-Markov models

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Language-independent compound splitting with morphological operations

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Automatic evaluation of Chinese translation output: word-level or character-level?

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Character-based pivot translation for under-resourced languages and domains

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Machine translation without words through substring alignment

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Character-level machine translation evaluation for languages with ambiguous word boundaries

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Combining word-level and character-level models for machine translation between closely-related languages

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
A bayesian model for learning SCFGs with discontiguous rules

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al. (Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon, pp. 632---641, 2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs.