Improving Korean verb-verb morphological disambiguation using lexical knowledge from unambiguous unlabeled data and selective web counts

  • Authors:
  • Seonho Kim;Juntae Yoon;Jungyun Seo;Seog Park

  • Affiliations:
  • Department of Computer Science, Sogang University, Seoul, Republic of Korea;Daumsoft Inc., Se-Ah Venture Tower, Seoul, Republic of Korea;Department of Computer Science, Sogang University, Seoul, Republic of Korea;Department of Computer Science, Sogang University, Seoul, Republic of Korea

  • Venue:
  • Pattern Recognition Letters
  • Year:
  • 2012

Quantified Score

Hi-index 0.10

Visualization

Abstract

This paper deals with verb-verb morphological disambiguation of two different verbs that have the same inflected form. The verb-verb morphological ambiguity (VVMA) is one of the critical Korean parts of speech (POS) tagging issues. The recognition of verb base forms related to ambiguous words highly depends on the lexical information in their surrounding contexts and the domains they occur in. However, current probabilistic morpheme-based POS tagging systems cannot handle VVMA adequately since most of them have a limitation to reflect a broad context of word level, and they are trained on too small amount of labeled training data to represent sufficient lexical information required for VVMA disambiguation. In this study, we suggest a classifier based on a large pool of raw text that contains sufficient lexical information to handle the VVMA. The underlying idea is that we automatically generate the annotated training set applicable to the ambiguity problem such as VVMA resolution via unlabeled unambiguous instances which belong to the same class. This enables to label ambiguous instances with the knowledge that can be induced from unambiguous instances. Since the unambiguous instances have only one label, the automatic generation of their annotated corpus are possible with unlabeled data. In our problem, since all conjugations of irregular verbs do not lead to the spelling changes that cause the VVMA, a training data for the VVMA disambiguation are generated via the instances of unambiguous conjugations related to each possible verb base form of ambiguous words. This approach does not require an additional annotation process for an initial training data set or a selection process for good seeds to iteratively augment a labeling set which are important issues in bootstrapping methods using unlabeled data. Thus, this can be strength against previous related works using unlabeled data. Furthermore, a plenty of confident seeds that are unambiguous and can show enough coverage for learning process are assured as well. We also suggest a strategy to extend the context information incrementally with web counts only to selected test examples that are difficult to predict using the current classifier or that are highly different from the pre-trained data set. As a result, automatic data generation and knowledge acquisition from unlabeled text for the VVMA resolution improved the overall tagging accuracy (token-level) by 0.04%. In practice, 9-10% out of verb-related tagging errors are fixed by the VVMA resolution whose accuracy was about 98% by using the Naive Bayes classifier coupled with selective web counts.