A Collection of Comparable Corpora for Under-resourced Languages

Authors:
Inguna Skadiņa;Ahmet Aker;Voula Giouli;Dan Tufis;Robert Gaizauskas;Madara Mieriņa;Nikos Mastropavlos
Affiliations:
Tilde, Latvia;University of Sheffield, UK;Institute for Language and Speech Processing, R.C. “Athena”, Greece;Research Institute for Artificial Intelligence, Romanian Academy Bucharest, Romania;University of Sheffield, UK;Tilde, Latvia;Institute for Language and Speech Processing, R.C. “Athena”, Greece
Venue:
Proceedings of the 2010 conference on Human Language Technologies -- The Baltic Perspective: Proceedings of the Fourth International Conference Baltic HLT 2010
Year:
2010

Citing 3
Cited 2

Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
Comparing corpora using frequency profiling

WCC '00 Proceedings of the workshop on Comparing corpora - Volume 9

ACCURAT toolkit for multi-level alignment and information extraction from comparable corpora

ACL '12 Proceedings of the ACL 2012 System Demonstrations
A comparable corpus based on aligned multilingual ontologies

MM '12 Proceedings of the First Workshop on Multilingual Modeling

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents work on collecting comparable corpora for 9 language pairs: Estonian-English, Latvian-English, Lithuanian-English, Greek-English, Greek-Romanian, Croatian-English, Romanian-English, Romanian-German and Slovenian-English. The objective of this work was to gather texts from the same domains and genres and with a similar level of comparability in order to use them as a starting point in defining criteria and metrics of comparability. These criteria and metrics will be applied to comparable texts to determine their suitability for use in Statistical Machine Translation, particularly in the case where translation is performed from or into under-resourced languages for which substantial parallel corpora are unavailable. The size of collected corpora is about 1 million words for each under-resourced language.