Learning verb complements for modern greek: Balancing the noisy dataset

  • Authors:
  • Katia Kermanidis;Manolis Maragoudakis;Nikos Fakotakis;George Kokkinakis

  • Affiliations:
  • Wire communications laboratory, department of electrical and computer engineering, university of patras, rio 26500, greece email: kerman@wcl.ee.upatras.gr, mmarag@wcl.ee.upatras.gr, fakotaki@wcl.e ...;Wire communications laboratory, department of electrical and computer engineering, university of patras, rio 26500, greece email: kerman@wcl.ee.upatras.gr, mmarag@wcl.ee.upatras.gr, fakotaki@wcl.e ...;Wire communications laboratory, department of electrical and computer engineering, university of patras, rio 26500, greece email: kerman@wcl.ee.upatras.gr, mmarag@wcl.ee.upatras.gr, fakotaki@wcl.e ...;Wire communications laboratory, department of electrical and computer engineering, university of patras, rio 26500, greece email: kerman@wcl.ee.upatras.gr, mmarag@wcl.ee.upatras.gr, fakotaki@wcl.e ...

  • Venue:
  • Natural Language Engineering
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Attempting to automatically learn to identify verb complements from natural language corpora without the help of sophisticated linguistic resources like grammars, parsers or treebanks leads to a significant amount of noise in the data. In machine learning terms, where learning from examples is performed using class-labelled feature-value vectors, noise leads to an imbalanced set of vectors: assuming that the class label takes two values (in this work complement/non-complement), one class (complements) is heavily underrepresented in the data in comparison to the other. To overcome the drop in accuracy when predicting instances of the rare class due to this disproportion, we balance the learning data by applying one-sided sampling to the training corpus and thus by reducing the number of non-complement instances. This approach has been used in the past in several domains (image processing, medicine, etc) but not in natural language processing. For identifying the examples that are safe to remove, we use the value difference metric, which proves to be more suitable for nominal attributes like the ones this work deals with, unlike the Euclidean distance, which has been used traditionally in one-sided sampling. We experiment with different learning algorithms which have been widely used and their performance is well known to the machine learning community: Bayesian learners, instance-based learners and decision trees. Additionally we present and test a variation of Bayesian belief networks, the COr-BBN (Class-oriented Bayesian belief network). The performance improves up to 22% after balancing the dataset, reaching 73.7% f-measure for the complement class, having made use only a phrase chunker and basic morphological information for preprocessing.