The SAWA corpus: a parallel corpus English - Swahili

Authors:
Guy De Pauw;Peter Waiganjo Wagacha;Gilles-Maurice de Schryver
Affiliations:
University of Antwerp, Belgium and University of Nairobi, Kenya;University of Nairobi, Kenya;Ghent University, Belgium and University of the Western Cape, South Africa
Venue:
AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Year:
2009

Citing 11
Cited 2

Machine translation: past, present, future

Machine translation: past, present, future
Fast and Accurate Sentence Alignment of Bilingual Corpora

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
A systematic comparison of various statistical alignment models

Computational Linguistics
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Evaluating translational correspondence using annotation projection

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Improving statistical MT through morphological analysis

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Effects of morphological analysis in translation between German and English

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Automatic diacritic restoration for resource-scarce languages

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Statistical machine translation into a morphologically complex language

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Data-Driven part-of-speech tagging of kiswahili

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue

Subword variation in text message classification

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Exploring the sawa corpus: collection and deployment of a parallel corpus English--Swahili

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Research in data-driven methods for Machine Translation has greatly benefited from the increasing availability of parallel corpora. Processing the same text in two different languages yields useful information on how words and phrases are translated from a source language into a target language. To investigate this, a parallel corpus is typically aligned by linking linguistic tokens in the source language to the corresponding units in the target language. An aligned parallel corpus therefore facilitates the automatic development of a machine translation system and can also bootstrap annotation through projection. In this paper, we describe data collection and annotation efforts and preliminary experimental results with a parallel corpus English - Swahili.