MULTEXT: Multilingual Text Tools and Corpora
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Learning to lemmatise slovene words
Learning language in logic
Czech Translation of G. Orwell's `1984': Morphology and Syntactic Patterns in the Corpus
TSD '99 Proceedings of the Second International Workshop on Text, Speech and Dialogue
Rules for Automatic Grapheme-to-Allophone Transcription in Slovene
TDS '00 Proceedings of the Third International Workshop on Text, Speech and Dialogue
Morphosyntactic Tagging of Slovene Using Progol
ILP '99 Proceedings of the 9th International Workshop on Inductive Logic Programming
A cheap and fast way to build useful translation lexicons
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
A large-scale inheritance-based morphological lexicon for Russian
MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
The MULTEXT-east morphosyntactic specifications for Slavic languages
MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
Some aspects of the morphological processing of Bulgarian
MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
Web-based frequency dictionaries for medium density languages
WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus
Bulgarian-Polish-Lithuanian corpus: current development
MRTECEEL '09 Proceedings of the Workshop on Multilingual Resources, Technologies and Evaluation for Central and Eastern European Languages
Creating a Persian-English comparable corpus
CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
MULTEXT-East: morphosyntactic resources for Central and Eastern European languages
Language Resources and Evaluation
Hi-index | 0.00 |
The EU Copernicus project Multext-East has created a multi-lingual corpus of text and speech data, covering the six languages of the project: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene. In addition, wordform lexicons for each of the languages were developed. The corpus includes a parallel component consisting of Orwell's Nineteen Eighty-Four, with versions in all six languages tagged for part-of-speech and aligned to English (also tagged for POS). We describe the encoding format and data architecture designed especially for this corpus, which is generally usable for encoding linguistic corpora. We also describe the methodology for the development of a harmonized set of morphosyntactic descriptions (MSDs), which builds upon the scheme for western European languages developed within the EAGLES project. We discuss the special concerns for handling the six project languages, which cover three distinct language families.