Source language markers in EUROPARL translations

Authors:
Hans van Halteren
Affiliations:
Radboud University Nijmegen, Nijmegen, The Netherlands
Venue:
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Year:
2008

Citing 1
Cited 8

Linguistic profiling of texts for the purpose of language verification

COLING '04 Proceedings of the 20th international conference on Computational Linguistics

Translationese and its dialects

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Language models for machine translation: original vs. translated texts

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Exploiting parse structures for native language identification

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Adapting translation models to translationese improves SMT

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Exploring adaptor grammars for native language identification

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Language models for machine translation: Original vs. translated texts

Computational Linguistics
Improving statistical machine translation by adapting translation models to translationese

Computational Linguistics
Improving statistical machine translation by adapting translation models to translationese

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper shows that it is very often possible to identify the source language of medium-length speeches in the EUROPARL corpus on the basis of frequency counts of word n-grams (87.2%--96.7% accuracy depending on classification method). The paper also examines in detail which positive markers are most powerful and identifies a number of linguistic aspects as well as culture- and domain-related ones.