Data-Driven part-of-speech tagging of kiswahili

Authors:
Guy De Pauw;Gilles-Maurice de Schryver;Peter W. Wagacha
Affiliations:
CNTS – Language Technology Group, University of Antwerp, Belgium;African Languages and Cultures, Ghent University, Belgium;School of Computing and Informatics, University of Nairobi, Kenya
Venue:
TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Year:
2006

Citing 6
Cited 6

Making large-scale support vector machine learning practical

Advances in kernel methods
Improving accuracy in word class tagging through the combination of machine learning systems

Computational Linguistics
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Disambiguation of morphological analysis in Bantu languages

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
The role of algorithm bias vs information source in learning algorithms for Morphosyntactic Disambiguation

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7

The SAWA corpus: a parallel corpus English - Swahili

AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Methods for Amharic part-of-speech tagging

AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili

Language Resources and Evaluation
Exploring the sawa corpus: collection and deployment of a parallel corpus English--Swahili

Language Resources and Evaluation
Introduction to the special issue on African Language Technology

Language Resources and Evaluation
Hybrid combination of constituency and dependency trees into an ensemble dependency parser

HYBRID '12 Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present experiments with data-driven part-of-speech taggers trained and evaluated on the annotated Helsinki Corpus of Swahili Using four of the current state-of-the-art data-driven taggers, TnT, MBT, SVMTool and MXPOST, we observe the latter as being the most accurate tagger for the Kiswahili dataset.We further improve on the performance of the individual taggers by combining them into a committee of taggers We observe that the more naive combination methods, like the novel plural voting approach, outperform more elaborate schemes like cascaded classifiers and weighted voting This paper is the first publication to present experiments on data-driven part-of-speech tagging for Kiswahili and Bantu languages in general.