A maximum entropy approach to natural language processing
Computational Linguistics
Inducing Features of Random Fields
IEEE Transactions on Pattern Analysis and Machine Intelligence
A study on word-based and integral-bit Chinese text compression algorithms
Journal of the American Society for Information Science
A systematic comparison of various statistical alignment models
Computational Linguistics
A statistical model for word discovery in transcribed speech
Computational Linguistics
Prospects for computer-assisted dialect adaptation
Computational Linguistics
Dialect MT: a case study between Cantonese and Mandarin
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
BLEU: a method for automatic evaluation of machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Contextual dependencies in unsupervised word segmentation
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Adaptive string distance measures for bilingual dialect lexicon induction
ACL '07 Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop
Bayesian semi-supervised Chinese word segmentation for statistical machine translation
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Bilingually motivated domain-adapted word segmentation for statistical machine translation
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Using a maximum entropy model to build segmentation lattices for MT
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Spoken Arabic dialect identification using phonotactic modeling
Semitic '09 Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages
Morphological analysis and generation for Arabic dialects
Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Improved statistical machine translation by multiple Chinese word segmentation
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Optimizing Chinese word segmentation for machine translation performance
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Evaluation of string distance algorithms for dialectology
LD '06 Proceedings of the Workshop on Linguistic Distances
Comparative study on corpora for speech translation
IEEE Transactions on Audio, Speech, and Language Processing
Dialect translation: integrating Bayesian co-segmentation models with pivot-based SMT
DIALECTS '11 Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties
Hi-index | 0.00 |
This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous source language text in order to improve the translation quality of statistical machine translation (SMT) approaches for the translation of local dialects by exploiting linguistic information of the standard language. The method iteratively learns multiple segmentation schemes that are consistent with (1) the standard dialect segmentations and (2) the phrasal segmentations of an SMT system trained on the resegmented bitext of the local dialect. In a second step multiple segmentation schemes are integrated into a single SMT system by characterizing the source language side and merging identical translation pairs of differently segmented SMT models. Experimental results translating three Japanese local dialects (Kumamoto, Kyoto, Osaka) into three Indo-European languages (English, German, Russian) revealed that the proposed system outperforms SMT engines trained on character-based as well as standard dialect segmentation schemes for the majority of the investigated translation tasks and automatic evaluation metrics.