Searching for poor quality machine translated text: learning the difference between human writing and machine translations

Authors:
Dave Carter;Diana Inkpen
Affiliations:
School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Ontario, Canada,Institute for Information Technology, National Research Council Canada, Canada;School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Ontario, Canada
Venue:
Canadian AI'12 Proceedings of the 25th Canadian conference on Advances in Artificial Intelligence
Year:
2012

Citing 5
Cited 0

Automatic detection of omissions in translations

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
One translation per discourse

DEW '09 Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions
Translationese and its dialects

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Language models for machine translation: original vs. translated texts

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Identification of translationese: a machine learning approach

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As machine translation (MT) tools have become mainstream, machine translated text has increasingly appeared on multilingual websites. Trustworthy multilingual websites are used as training corpora for statistical machine translation tools; large amounts of MT text in training data may make such products less effective. We performed three experiments to determine whether a support vector machine (SVM) could distinguish machine translated text from human written text (both original text and human translations). Machine translated versions of the Canadian Hansard were detected with an F-measure of 0.999. Machine translated versions of six Government of Canada web sites were detected with an F-measure of 0.98. We validated these results with a decision tree classifier. An experiment to find MT text on Government of Ontario web sites using Government of Canada training data was unfruitful, with a high rate of false positives. Machine translated text appears to be learnable and detectable when using a similar training corpus.