A Morphological Tagger for Korean: Statistical Tagging Combined with Corpus-Based Morphological Rule Application

Authors:
Chung-Hye Han;Martha Palmer
Affiliations:
Department of Linguistics, Simon Fraser University, Burnaby, Canada V5A1SC;Department of Computer and Information Sciences, University of Pennsylvania, Philadelphia, USA 19104-6389
Venue:
Machine Translation
Year:
2004

Citing 7
Cited 4

Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Coping with ambiguity and unknown words through probabilistic models

Computational Linguistics - Special issue on using large corpora: II
Combining stochastic and rule-based methods for disambiguation in agglutinative languages

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Tagging inflective languages: prediction of morphological categories for a rich, structured tagset

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Directed replacement

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Lexicalized hidden Markov models for part-of-speech tagging

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Serial combination of rules and statistics: a case study in Czech tagging

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics

Building a Korean web corpus for analyzing learner language

WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Factors affecting the accuracy of Korean parsing

SPMRL '10 Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
Developing methodology for Korean particle error detection

IUNLPBEA '11 Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications
Improving Korean verb-verb morphological disambiguation using lexical knowledge from unambiguous unlabeled data and selective web counts

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a novel approach to morphological tagging for Korean, an agglutinative language with a very productive inflectional system. The tagger takes raw text as input and returns a lemmatized and morphologically disambiguated output for each word: the lemma is labeled with a part-of-speech (POS) tag and the inflections are labeled with inflectional tags. Unlike the standard approach to tagging for morphologically complex languages, in our proposed approach the tagging phase precedes the analysis phase. It comprises a trigram-based tagging component followed by a morphological rule application component, obtaining 95% precision and recall on unseen test data.