Tiered Tagging and Combined Language Models Classifiers
TSD '99 Proceedings of the Second International Workshop on Text, Speech and Dialogue
Morphological tagging: data vs. dictionaries
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
MULTEXT: Multilingual Text Tools and Corpora
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Manually annotated Hungarian corpus
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
A cheap and fast way to build useful translation lexicons
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
CoNLL-X shared task on multilingual dependency parsing
CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
Building the Croatian morphological lexicon
MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
Word alignment for languages with scarce resources
ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
A global model for joint lemmatization and part-of-speech prediction
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
OWL/DL formalization of the MULTEXT-East morphosyntactic specifications
LAW V '11 Proceedings of the 5th Linguistic Annotation Workshop
Persian in MULTEXT-East framework
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Hi-index | 0.00 |
The paper presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the morphosyntactic specifications, morphosyntactic lexica, and a parallel corpus, the novel "1984" by George Orwell, which is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages, mainly from Central and Eastern Europe: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset, unique in terms of languages covered and the wealth of encoding, is extensively documented, and freely available for research purposes. The paper overviews the MULTEXT-East resources by type and language and gives some conclusions and directions for further work.