Correction of errors in a verb modality corpus for machine translation with a machine-learning method

Authors:
Masaki Murata;Masao Utiyama;Kiyotaka Uchimoto;Hitoshi Isahara;Qing Ma
Affiliations:
National Institute of Information and Communications Technology;National Institute of Information and Communications Technology;National Institute of Information and Communications Technology;National Institute of Information and Communications Technology;Ryukoku University, and National Institute of Information and Communications Technology
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2005

Citing 9
Cited 3

A framework of a mechanical translation between Japanese and English by analogy principle

Proc. of the international NATO symposium on Artificial and human intelligence
Introduction to statistical pattern recognition (2nd ed.)

Introduction to statistical pattern recognition (2nd ed.)
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Comparison of three machine-learning methods for Thai part-of-speech tagging

ACM Transactions on Asian Language Information Processing (TALIP)
Learning Decision Lists

Machine Learning
Inducing Features of Random Fields

Inducing Features of Random Fields
Detecting errors within a corpus using anomaly detection

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Using a support-vector machine for Japanese-to-English translation of tense, aspect, and modality

DMMT '01 Proceedings of the workshop on Data-driven methods in machine translation - Volume 14

Correcting errors in a treebank based on synchronous tree substitution grammar

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Collaborative data cleaning for sentiment classification with noisy training corpus

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Improving Text Classification Accuracy by Training Label Cleaning

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, various types of tagged corpora have been constructed and much research using tagged corpora has been done. However, tagged corpora contain errors, which impedes the progress of research. Therefore, the correction of errors in corpora is an important research issue. In this study we investigate the correction of such errors, which we call corpus correction. Using machine-learning methods, we applied corpus correction to a verb modality corpus for machine translation. We used the maximum-entropy and decision-list methods as machine-learning methods. We compared several kinds of methods for corpus correction in our experiments, and determined which is most effective by using a statistical test. We obtained several noteworthy findings: (1) Precision was almost the same for both detection and correction, so it is more convenient to do both correction and detection, rather than detection only. (2) In general, the maximum-entropy method worked better than the decision-list method; but the two methods had almost the same precision for the top 50 pieces of extracted data when closed data was used. (3) In terms of precision, the use of closed data was better than the use of open data; however, in terms of the total number of extracted errors, the use of open data was better than the use of closed data. Based on our analysis of these results, we developed a good method for corpus correction. We confirmed the effectiveness of our method by carrying out experiments on machine translation. As corpus-based machine translation continues to be developed, the corpus correction we discuss in this article should prove to be increasingly significant.