Manually annotated Hungarian corpus

  • Authors:
  • Zoltán Alexin;Tibor Gyimóthy;Csaba Hatvani;László Tihanyi;János Csirik;Károly Bibok;Gábor Prószéky

  • Affiliations:
  • University of Szeged;Intelligence at University of Szeged;University of Szeged;MorphoLogic, Budapest;University of Szeged;University of Szeged;MorphoLogic, Budapest

  • Venue:
  • EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Current paper presents the results of a two-year project during which a consortium of the University of Szeged and the MorphoLogic Ltd. Budapest developed a morpho-syntactically parsed and annotated (disambiguated) corpus for Hungarian. For morpho-syntactic encoding, the Hungarian version of MSD (Morpho-Syntactic Description) has been used. The corpus contains texts of five different topic areas: schoolchildren's compositions, fiction, computer-related texts, news, and legal texts. During annotation, linguists have checked the morpho-syntactic parsing of each word. Finding part-of-speech tagging (disambiguation) rules by machine learning algorithms was also studied by the researchers of the consortium. Due to the fact that the size of the corpus reaches up to 1 million text words without punctuation characters, it may serve as a reference source for numerous future research applications. The corpus can be obtained freely via Internet for research and educational purposes.