Searching for poor quality machine translated text: learning the difference between human writing and machine translations

  • Authors:
  • Dave Carter;Diana Inkpen

  • Affiliations:
  • School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Ontario, Canada,Institute for Information Technology, National Research Council Canada, Canada;School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Ontario, Canada

  • Venue:
  • Canadian AI'12 Proceedings of the 25th Canadian conference on Advances in Artificial Intelligence
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

As machine translation (MT) tools have become mainstream, machine translated text has increasingly appeared on multilingual websites. Trustworthy multilingual websites are used as training corpora for statistical machine translation tools; large amounts of MT text in training data may make such products less effective. We performed three experiments to determine whether a support vector machine (SVM) could distinguish machine translated text from human written text (both original text and human translations). Machine translated versions of the Canadian Hansard were detected with an F-measure of 0.999. Machine translated versions of six Government of Canada web sites were detected with an F-measure of 0.98. We validated these results with a decision tree classifier. An experiment to find MT text on Government of Ontario web sites using Government of Canada training data was unfruitful, with a high rate of false positives. Machine translated text appears to be learnable and detectable when using a similar training corpus.