The MULTEXT-east morphosyntactic specifications for Slavic languages

Authors:
Tomaž Erjavec;Cvetana Krstev;Vladimír Petkevič;Kiril Simov;Marko Tadić;Duško Vitas
Affiliations:
Jožef Stefan Institute, Ljubljana;University of Belgrade;Charles University, Prague;Bulgarian Academy of Sciences;Zagreb University;University of Belgrade
Venue:
MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
Year:
2003

Citing 4
Cited 3

Morphological tagging: data vs. dictionaries

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Multext-East: parallel and comparable corpora and lexicons for six Central and Eastern European languages

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
MULTEXT: Multilingual Text Tools and Corpora

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Building the Croatian morphological lexicon

MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages

TermeX: A Tool for Collocation Extraction

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
OWL/DL formalization of the MULTEXT-East morphosyntactic specifications

LAW V '11 Proceedings of the 5th Linguistic Annotation Workshop
Persian in MULTEXT-East framework

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Word-level morphosyntactic descriptions, such as "Ncmsn" designating a common masculine singular noun in the nominative, have been developed for all Slavic languages, yet there have been few attempts to arrive at a proposal that would be harmonised across the languages. Standardisation adds to the interchange potential of the resources, making it easier to develop multilingual applications or to evaluate language technology tools across several languages. The process of the harmonisation of morphosyntactic categories, esp. for morphologically rich Slavic languages is also interesting from a language-typological perspective. The EU Multext-East project developed corpora, lexica and tools for seven languages, with the focus being on morphosyntactic data, including formal, EAGLES-based specifications for lexical morphosyntactic descriptions. The specifications were later extended, so that they currently cover nine languages, five from the Slavic family: Bulgarian, Croatian, Czech, Serbian and Slovene. The paper presents these morphosyntactic specifications, giving their background and structure, including the encoding of the tables as TEI feature structures. The five Slavic language specifications are discussed in more depth.