Exploring the sawa corpus: collection and deployment of a parallel corpus English--Swahili

Authors:
Guy Pauw;Peter Waiganjo Wagacha;Gilles-Maurice Schryver
Affiliations:
CLiPS, Department of Linguistics, University of Antwerp, Antwerp, Belgium and School of Computing and Informatics, University of Nairobi, Nairobi, Kenya;School of Computing and Informatics, University of Nairobi, Nairobi, Kenya;Department of African Languages and Cultures, Ghent University, Ghent, Belgium and Xhosa Department, University of the Western Cape, Cape Town, South Africa
Venue:
Language Resources and Evaluation
Year:
2011

Citing 15
Cited 2

Fast and Accurate Sentence Alignment of Bilingual Corpora

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
A systematic comparison of various statistical alignment models

Computational Linguistics
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Evaluating translational correspondence using annotation projection

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Efficient optimization for bilingual sentence alignment based on linear regression

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Measuring Word Alignment Quality for Statistical Machine Translation

Computational Linguistics
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
The SAWA corpus: a parallel corpus English - Swahili

AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Using technology transfer to advance automatic lemmatisation for Setswana

AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Part-of-speech tagging of Northern Sotho: disambiguating polysemous function words

AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Methods for Amharic part-of-speech tagging

AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
English-to-Czech factored machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Effects of morphological analysis in translation between German and English

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Statistical machine translation into a morphologically complex language

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Data-Driven part-of-speech tagging of kiswahili

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue

Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili

Language Resources and Evaluation
Statistical unicodification of African languages

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus English--Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing English--Swahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.