Building a training corpus for word sense disambiguation in English-to-Vietnamese machine translation

Authors:
Dien Dinh
Affiliations:
VNU-HCMC, Vietnam
Venue:
COLING-MTIA '02 Proceedings of the 2002 COLING workshop on Machine translation in Asia - Volume 16
Year:
2002

Citing 4
Cited 3

A corpus-based approach to language learning

A corpus-based approach to language learning
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
A class-based approach to word alignment

Computational Linguistics
A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics

POS-tagger for English-Vietnamese bilingual corpus

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Word segmentation for the Myanmar language

Journal of Information Science
Measuring historical word sense variation

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

The most difficult task in machine translation is the elimination of ambiguity in human languages. A certain word in English as well as Vietnamese often has different meanings which depend on their syntactical position in the sentence and the actual context. In order to solve this ambiguation, formerly, people used to resort to many hand-coded rules. Nevertheless, manually building these rules is a time-consuming and exhausting task. So, we suggest an automatic method to solve the above-mentioned problem by using semantically tagged corpus. In this paper, we mainly present building a semantically tagged bilingual corpus to word sense disambiguation (WSD) in English texts. To assign semantic tags, we have taken advantage of bilingual texts via word alignments with semantic class names of LLOCE (Longman Lexicon of Contemporary English). So far, we have built 5,000,000-word bilingual corpus in which 1,000,000 words have been semantically annotated with the accuracy of 70%. We have evaluated our result of semantic tagging by comparing with SEMCOR on SUSANNE part of our corpus. This semantically annotated corpus will be used to extract disambiguation rules automatically by TBL (Transformation-based Learning) method. These rules will be manually revised before being applied to the WSD module in the English-to-Vietnamese Translation (EVT) system.