Word segmentation for dialect translation

Authors:
Michael Paul;Andrew Finch;Eiichiro Sumita
Affiliations:
National Institute of Information and Communications Technology, Kyoto, Japan;National Institute of Information and Communications Technology, Kyoto, Japan;National Institute of Information and Communications Technology, Kyoto, Japan
Venue:
CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Year:
2011

Citing 20
Cited 1

A maximum entropy approach to natural language processing

Computational Linguistics
Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
A study on word-based and integral-bit Chinese text compression algorithms

Journal of the American Society for Information Science
A systematic comparison of various statistical alignment models

Computational Linguistics
A statistical model for word discovery in transcribed speech

Computational Linguistics
Prospects for computer-assisted dialect adaptation

Computational Linguistics
Dialect MT: a case study between Cantonese and Mandarin

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Dialect classification for online podcasts fusing acoustic and language based structural and semantic information

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Adaptive string distance measures for bilingual dialect lexicon induction

ACL '07 Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop
Bayesian semi-supervised Chinese word segmentation for statistical machine translation

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Bilingually motivated domain-adapted word segmentation for statistical machine translation

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Using a maximum entropy model to build segmentation lattices for MT

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Spoken Arabic dialect identification using phonotactic modeling

Semitic '09 Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages
Morphological analysis and generation for Arabic dialects

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Improved statistical machine translation by multiple Chinese word segmentation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Optimizing Chinese word segmentation for machine translation performance

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Evaluation of string distance algorithms for dialectology

LD '06 Proceedings of the Workshop on Linguistic Distances
Comparative study on corpora for speech translation

IEEE Transactions on Audio, Speech, and Language Processing

Dialect translation: integrating Bayesian co-segmentation models with pivot-based SMT

DIALECTS '11 Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous source language text in order to improve the translation quality of statistical machine translation (SMT) approaches for the translation of local dialects by exploiting linguistic information of the standard language. The method iteratively learns multiple segmentation schemes that are consistent with (1) the standard dialect segmentations and (2) the phrasal segmentations of an SMT system trained on the resegmented bitext of the local dialect. In a second step multiple segmentation schemes are integrated into a single SMT system by characterizing the source language side and merging identical translation pairs of differently segmented SMT models. Experimental results translating three Japanese local dialects (Kumamoto, Kyoto, Osaka) into three Indo-European languages (English, German, Russian) revealed that the proposed system outperforms SMT engines trained on character-based as well as standard dialect segmentation schemes for the majority of the investigated translation tasks and automatic evaluation metrics.