MULTEXT-East: morphosyntactic resources for Central and Eastern European languages

Authors:
Tomaž Erjavec
Affiliations:
Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 1000
Venue:
Language Resources and Evaluation
Year:
2012

Citing 14
Cited 0

Tiered Tagging and Combined Language Models Classifiers

TSD '99 Proceedings of the Second International Workshop on Text, Speech and Dialogue
Morphological tagging: data vs. dictionaries

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Multext-East: parallel and comparable corpora and lexicons for six Central and Eastern European languages

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
MULTEXT: Multilingual Text Tools and Corpora

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
A unification-based approach to morpho-syntactic parsing of agglutinative and other (highly) inflectional languages

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Manually annotated Hungarian corpus

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
A cheap and fast way to build useful translation lexicons

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
CoNLL-X shared task on multilingual dependency parsing

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
A flexemic tagset for Polish

MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
Building the Croatian morphological lexicon

MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
Word alignment for languages with scarce resources

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
A global model for joint lemmatization and part-of-speech prediction

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
OWL/DL formalization of the MULTEXT-East morphosyntactic specifications

LAW V '11 Proceedings of the 5th Linguistic Annotation Workshop
Persian in MULTEXT-East framework

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the morphosyntactic specifications, morphosyntactic lexica, and a parallel corpus, the novel "1984" by George Orwell, which is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages, mainly from Central and Eastern Europe: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset, unique in terms of languages covered and the wealth of encoding, is extensively documented, and freely available for research purposes. The paper overviews the MULTEXT-East resources by type and language and gives some conclusions and directions for further work.