Learning verb complements for modern greek: Balancing the noisy dataset

Authors:
Katia Kermanidis;Manolis Maragoudakis;Nikos Fakotakis;George Kokkinakis
Affiliations:
Wire communications laboratory, department of electrical and computer engineering, university of patras, rio 26500, greece email: kerman@wcl.ee.upatras.gr, mmarag@wcl.ee.upatras.gr, fakotaki@wcl.e ...;Wire communications laboratory, department of electrical and computer engineering, university of patras, rio 26500, greece email: kerman@wcl.ee.upatras.gr, mmarag@wcl.ee.upatras.gr, fakotaki@wcl.e ...;Wire communications laboratory, department of electrical and computer engineering, university of patras, rio 26500, greece email: kerman@wcl.ee.upatras.gr, mmarag@wcl.ee.upatras.gr, fakotaki@wcl.e ...;Wire communications laboratory, department of electrical and computer engineering, university of patras, rio 26500, greece email: kerman@wcl.ee.upatras.gr, mmarag@wcl.ee.upatras.gr, fakotaki@wcl.e ...
Venue:
Natural Language Engineering
Year:
2008

Citing 24
Cited 0

Toward memory-based reasoning

Communications of the ACM - Special issue on parallelism
Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Instance-Based Learning Algorithms

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust Classification for Imprecise Environments

Machine Learning
Machine Learning

Machine Learning
Learning Human Face Detection in Cluttered Scenes

CAIP '95 Proceedings of the 6th International Conference on Computer Analysis of Images and Patterns
Improving Identification of Difficult Small Classes by Balancing Class Distribution

AIME '01 Proceedings of the 8th Conference on AI in Medicine in Europe: Artificial Intelligence Medicine
A Statistical Syntactic Disambiguation Program and What it Learns

A Statistical Syntactic Disambiguation Program and What it Learns
From grammar to lexicon: unsupervised learning of lexical syntax

Computational Linguistics - Special issue on using large corpora: II
Automatic extraction of subcategorization from corpora

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Automatic acquisition of a large subcategorization dictionary from corpora

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Automatic extraction of subcategorization frames for Czech

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Statistical filtering and subcategorization frame acquisition

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Automatic distinction of arguments and modifiers: the case of prepositional phrases

ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
Learning argument/adjunct distinction for Basque

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Comparing Bayesian network classifiers

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Attempting to automatically learn to identify verb complements from natural language corpora without the help of sophisticated linguistic resources like grammars, parsers or treebanks leads to a significant amount of noise in the data. In machine learning terms, where learning from examples is performed using class-labelled feature-value vectors, noise leads to an imbalanced set of vectors: assuming that the class label takes two values (in this work complement/non-complement), one class (complements) is heavily underrepresented in the data in comparison to the other. To overcome the drop in accuracy when predicting instances of the rare class due to this disproportion, we balance the learning data by applying one-sided sampling to the training corpus and thus by reducing the number of non-complement instances. This approach has been used in the past in several domains (image processing, medicine, etc) but not in natural language processing. For identifying the examples that are safe to remove, we use the value difference metric, which proves to be more suitable for nominal attributes like the ones this work deals with, unlike the Euclidean distance, which has been used traditionally in one-sided sampling. We experiment with different learning algorithms which have been widely used and their performance is well known to the machine learning community: Bayesian learners, instance-based learners and decision trees. Additionally we present and test a variation of Bayesian belief networks, the COr-BBN (Class-oriented Bayesian belief network). The performance improves up to 22% after balancing the dataset, reaching 73.7% f-measure for the complement class, having made use only a phrase chunker and basic morphological information for preprocessing.