MULTEXT-East: morphosyntactic resources for Central and Eastern European languages

  • Authors:
  • Tomaž Erjavec

  • Affiliations:
  • Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 1000

  • Venue:
  • Language Resources and Evaluation
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The paper presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the morphosyntactic specifications, morphosyntactic lexica, and a parallel corpus, the novel "1984" by George Orwell, which is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages, mainly from Central and Eastern Europe: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset, unique in terms of languages covered and the wealth of encoding, is extensively documented, and freely available for research purposes. The paper overviews the MULTEXT-East resources by type and language and gives some conclusions and directions for further work.