Fast and Accurate Sentence Alignment of Bilingual Corpora
AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
A systematic comparison of various statistical alignment models
Computational Linguistics
Computational Linguistics - Special issue on web as corpus
Evaluating translational correspondence using annotation projection
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Efficient optimization for bilingual sentence alignment based on linear regression
HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Measuring Word Alignment Quality for Statistical Machine Translation
Computational Linguistics
Moses: open source toolkit for statistical machine translation
ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
The SAWA corpus: a parallel corpus English - Swahili
AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Using technology transfer to advance automatic lemmatisation for Setswana
AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Part-of-speech tagging of Northern Sotho: disambiguating polysemous function words
AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Methods for Amharic part-of-speech tagging
AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
English-to-Czech factored machine translation
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Effects of morphological analysis in translation between German and English
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Statistical machine translation into a morphologically complex language
CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Data-Driven part-of-speech tagging of kiswahili
TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili
Language Resources and Evaluation
Statistical unicodification of African languages
Language Resources and Evaluation
Hi-index | 0.00 |
Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus English--Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing English--Swahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.