Multilingual mobile-phone translation services for world travelers
COLING '08 22nd International Conference on on Computational Linguistics: Demonstration Papers
Unsupervised search for the optimal segmentation for statistical machine translation
ACLstudent '10 Proceedings of the ACL 2010 Student Research Workshop
WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Word segmentation for dialect translation
CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Bayesian word alignment for statistical machine translation
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Dialect translation: integrating Bayesian co-segmentation models with pivot-based SMT
DIALECTS '11 Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties
Distributed speech translation technologies for multiparty multilingual communication
ACM Transactions on Speech and Language Processing (TSLP)
SSST-6 '12 Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation
A-STAR: Toward translating Asian spoken languages
Computer Speech and Language
How to Choose the Best Pivot Language for Automatic Translation of Low-Resource Languages
ACM Transactions on Asian Language Information Processing (TALIP)
Hi-index | 0.00 |
This paper investigates issues in preparing corpora for developing speech-to-speech translation (S2ST). It is impractical to create a broad-coverage parallel corpus only from dialog speech. An alternative approach is to have bilingual experts write conversational-style texts in the target domain, with translations. There is, however, a risk of losing fidelity to the actual utterances. This paper focuses on balancing a tradeoff between these two kinds of corpora through the analysis of two newly developed corpora in the travel domain: a bilingual parallel corpus with 420 K utterances and a collection of in-domain dialogs using actual S2ST systems. We found that the first corpus is effective for covering utterances in the second corpus if complimented with a small number of utterances taken from monolingual dialogs. We also found that characteristics of in-domain utterances become closer to those of the first corpus when more restrictive conditions and instructions to speakers are given. These results suggest the possibility of a bootstrap-style of development of corpora and S2ST systems, where an initial S2ST system is developed with parallel texts, and is then gradually improved with in-domain utterances collected by the system as restrictions are relaxed