Morphological tagging: data vs. dictionaries

Authors:
Jan Hajič
Affiliations:
Johns Hopkins University, Baltimore, MD
Venue:
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Year:
2000

Citing 5
Cited 22

Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Tagging English text with a probabilistic model

Computational Linguistics
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
Classifier combination for improved lexical disambiguation

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Tagging inflective languages: prediction of morphological categories for a rich, structured tagset

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1

A new approach to conceptual document indexing: building a hierarchical system of concepts based on document clusters

ISICT '03 Proceedings of the 1st international symposium on Information and communication technologies
Serial combination of rules and statistics: a case study in Czech tagging

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Language independent, minimally supervised induction of lexical probabilities

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Searching for topics in a large collection of texts

ACLstudent '04 Proceedings of the ACL 2004 workshop on Student research
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
The best of two worlds: cooperation of statistical and rule-based taggers for Czech

ACL '07 Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies
The MULTEXT-east morphosyntactic specifications for Slavic languages

MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
A reconfigurable stochastic tagger for languages with complex tag structure

MorphSlav '03 Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
Arabic diacritization through full morphological tagging

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Transformation-based part-of-speech tagging for Serbian language

CIMMACS'09 Proceedings of the 8th WSEAS International Conference on Computational intelligence, man-machine systems and cybernetics
Digitisation and automatic alignment of the dialog corpus: a prosodically annotated corpus of Czech television debates

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
A new approach to lexical disambiguation of Arabic text

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Application of stacked methods to part-of-speech tagging of polish

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
An efficient part-of-speech tagger for arabic

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
Lessons from building a Persian written corpus: Peykare

Language Resources and Evaluation
Experiments in cross-language morphological annotation transfer

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Automatic transcription of numerals in inflectional languages

TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue
Extensive study on automatic verb sense disambiguation in czech

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
MULTEXT-East: morphosyntactic resources for Central and Eastern European languages

Language Resources and Evaluation
Lemmatisation as a tagging task

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Part of Speech tagging for English seems to have reached the the human levels of error, but full morphological tagging for inflectionally rich languages, such as Romanian, Czech, or Hungarian, is still an open problem, and the results are far from being satisfactory. This paper presents results obtained by using a universalized exponential feature-based model for five such languages. It focuses on the data sparseness issue, which is especially severe for such languages (the more so that there are no extensive annotated data for those languages). In conclusion, we argue strongly that the use of an independent morphological dictionary is the preferred choice to more annotated data under such circumstances.