An empirical evaluation of stop word removal in statistical machine translation

Authors:
Chong Tze Yuang;Rafael E. Banchs;Chng Eng Siong
Affiliations:
Nanyang Technological University, Singapore;Institute for Infocomm Research, A*STAR, Singapore;Nanyang Technological University, Singapore
Venue:
EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
Year:
2012

Citing 7
Cited 0

Text compression

Text compression
A stop list for general text

ACM SIGIR Forum
A classification approach to word prediction

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
User Interaction with Word Prediction: The Effects of Prediction Quality

ACM Transactions on Accessible Computing (TACCESS)
(Meta-) evaluation of machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
When stopword lists make the difference

Journal of the American Society for Information Science and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we evaluate the possibility of improving the performance of a statistical machine translation system by relaxing the complexity of the translation task by removing the most frequent and predictable terms from the target language vocabulary. Afterwards, the removed terms are inserted back in the relaxed output by using an n-gram based word predictor. Empirically, we have found that when these words are omitted from the text, the perplexity of the text decreases, which may imply the reduction of confusion in the text. We conducted some machine translation experiments to see if this perplexity reduction produced a better translation output. While the word prediction results exhibits 77% accuracy in predicting 40% of the most frequent words in the text, the perplexity reduction did not help to produce better translations.