Multext-East: parallel and comparable corpora and lexicons for six Central and Eastern European languages

Authors:
Ludmila Dimitrova;Nancy Ide;Vladimir Petkevic;Tomaz Erjavec;Heiki Jaan Kaalep;Dan Tufis
Affiliations:
Institute of Mathematics and Informatics, Sofia, Bulgaria;Vassar College, Poughkeepsie, New York;Charles University, Prague, Czech Republic;Institute Jozef Stefan, Ljubljana, Slovenia;University of Tartu, Tartu, Estonia;Romanian Academy, Center for Artificial Intelligence, Bucharest, Romania
Venue:
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Year:
1998

Citing 1
Cited 12

MULTEXT: Multilingual Text Tools and Corpora

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1

Learning to lemmatise slovene words

Learning language in logic
Czech Translation of G. Orwell's `1984': Morphology and Syntactic Patterns in the Corpus

TSD '99 Proceedings of the Second International Workshop on Text, Speech and Dialogue
Rules for Automatic Grapheme-to-Allophone Transcription in Slovene

TDS '00 Proceedings of the Third International Workshop on Text, Speech and Dialogue
Morphosyntactic Tagging of Slovene Using Progol

ILP '99 Proceedings of the 9th International Workshop on Inductive Logic Programming
A cheap and fast way to build useful translation lexicons

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
A large-scale inheritance-based morphological lexicon for Russian

MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
The MULTEXT-east morphosyntactic specifications for Slavic languages

MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
Some aspects of the morphological processing of Bulgarian

MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
Web-based frequency dictionaries for medium density languages

WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus
Bulgarian-Polish-Lithuanian corpus: current development

MRTECEEL '09 Proceedings of the Workshop on Multilingual Resources, Technologies and Evaluation for Central and Eastern European Languages
Creating a Persian-English comparable corpus

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
MULTEXT-East: morphosyntactic resources for Central and Eastern European languages

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The EU Copernicus project Multext-East has created a multi-lingual corpus of text and speech data, covering the six languages of the project: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene. In addition, wordform lexicons for each of the languages were developed. The corpus includes a parallel component consisting of Orwell's Nineteen Eighty-Four, with versions in all six languages tagged for part-of-speech and aligned to English (also tagged for POS). We describe the encoding format and data architecture designed especially for this corpus, which is generally usable for encoding linguistic corpora. We also describe the methodology for the development of a harmonized set of morphosyntactic descriptions (MSDs), which builds upon the scheme for western European languages developed within the EAGLES project. We discuss the special concerns for handling the six project languages, which cover three distinct language families.