Manually annotated Hungarian corpus

Authors:
Zoltán Alexin;Tibor Gyimóthy;Csaba Hatvani;László Tihanyi;János Csirik;Károly Bibok;Gábor Prószéky
Affiliations:
University of Szeged;Intelligence at University of Szeged;University of Szeged;MorphoLogic, Budapest;University of Szeged;University of Szeged;MorphoLogic, Budapest
Venue:
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
Year:
2003

Citing 1
Cited 5

A unification-based approach to morpho-syntactic parsing of agglutinative and other (highly) inflectional languages

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics

Noun phrase recognition with tree patterns

Acta Cybernetica
Learning tree patterns for syntactic parsing

Acta Cybernetica
Learning syntactic patterns using boosting and other classifier combination schemas

TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue
MULTEXT-East: morphosyntactic resources for Central and Eastern European languages

Language Resources and Evaluation
Dependency parsing of Hungarian: baseline results and challenges

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current paper presents the results of a two-year project during which a consortium of the University of Szeged and the MorphoLogic Ltd. Budapest developed a morpho-syntactically parsed and annotated (disambiguated) corpus for Hungarian. For morpho-syntactic encoding, the Hungarian version of MSD (Morpho-Syntactic Description) has been used. The corpus contains texts of five different topic areas: schoolchildren's compositions, fiction, computer-related texts, news, and legal texts. During annotation, linguists have checked the morpho-syntactic parsing of each word. Finding part-of-speech tagging (disambiguation) rules by machine learning algorithms was also studied by the researchers of the consortium. Due to the fact that the size of the corpus reaches up to 1 million text words without punctuation characters, it may serve as a reference source for numerous future research applications. The corpus can be obtained freely via Internet for research and educational purposes.