The effect of borderline examples on language learning

  • Authors:
  • Katia Lida Kermanidis

  • Affiliations:
  • Department of Informatics, Ionian University, Corfu, Greece

  • Venue:
  • Journal of Experimental & Theoretical Artificial Intelligence
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Imbalanced training sets, where one or more classes are heavily underrepresented in the training data compared to the others, prove to be problematic when trying to classify new instances that belong to a rare class. In the present article, class imbalance occurs in the datasets of three language learning applications: automatic identification of verb complements, automatic recognition of semantic entities and learning taxonomic relations. All three applications are tested on Modern Greek text corpora; verb complement identification is applied to English corpora also, and comparative results are presented. The difference in statistical behaviour between the classes may be attributed to the low pre-processing level of the corpora, and/or the automatic nature of the pre-processing phase, and/or the exceptional occurrence of the linguistic information of interest in the data. Two different approaches are experimented with in order to deal with the problem: one-sided sampling (OSS) of the dataset and classification using support vector machines (SVMs). OSS removes redundant, noisy and misleading instances of the majority class, reducing thereby the training set size, while SVMs, without removing any instances, take into account only examples that appear close to the borderline region between the classes. The better results achieved by OSS in all test cases lead to some very interesting observations regarding the impact of the various training instances on classification, depending on their position in the feature-vector space.