Communications of the ACM - Special issue on parallelism
Probabilistic reasoning in intelligent systems: networks of plausible inference
Probabilistic reasoning in intelligent systems: networks of plausible inference
Instance-Based Learning Algorithms
Machine Learning
C4.5: programs for machine learning
C4.5: programs for machine learning
A sequential algorithm for training text classifiers
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss
Machine Learning - Special issue on learning with probabilistic representations
MetaCost: a general method for making classifiers cost-sensitive
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust Classification for Imprecise Environments
Machine Learning
Machine Learning
Learning Human Face Detection in Cluttered Scenes
CAIP '95 Proceedings of the 6th International Conference on Computer Analysis of Images and Patterns
Improving Identification of Difficult Small Classes by Balancing Class Distribution
AIME '01 Proceedings of the 8th Conference on AI in Medicine in Europe: Artificial Intelligence Medicine
A Statistical Syntactic Disambiguation Program and What it Learns
A Statistical Syntactic Disambiguation Program and What it Learns
From grammar to lexicon: unsupervised learning of lexical syntax
Computational Linguistics - Special issue on using large corpora: II
Automatic extraction of subcategorization from corpora
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Automatic acquisition of a large subcategorization dictionary from corpora
ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Automatic extraction of subcategorization frames for Czech
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Mining with rarity: a unifying framework
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A study of the behavior of several methods for balancing machine learning training data
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Statistical filtering and subcategorization frame acquisition
EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Automatic distinction of arguments and modifiers: the case of prepositional phrases
ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
Learning argument/adjunct distinction for Basque
ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Comparing Bayesian network classifiers
UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Hi-index | 0.00 |
Attempting to automatically learn to identify verb complements from natural language corpora without the help of sophisticated linguistic resources like grammars, parsers or treebanks leads to a significant amount of noise in the data. In machine learning terms, where learning from examples is performed using class-labelled feature-value vectors, noise leads to an imbalanced set of vectors: assuming that the class label takes two values (in this work complement/non-complement), one class (complements) is heavily underrepresented in the data in comparison to the other. To overcome the drop in accuracy when predicting instances of the rare class due to this disproportion, we balance the learning data by applying one-sided sampling to the training corpus and thus by reducing the number of non-complement instances. This approach has been used in the past in several domains (image processing, medicine, etc) but not in natural language processing. For identifying the examples that are safe to remove, we use the value difference metric, which proves to be more suitable for nominal attributes like the ones this work deals with, unlike the Euclidean distance, which has been used traditionally in one-sided sampling. We experiment with different learning algorithms which have been widely used and their performance is well known to the machine learning community: Bayesian learners, instance-based learners and decision trees. Additionally we present and test a variation of Bayesian belief networks, the COr-BBN (Class-oriented Bayesian belief network). The performance improves up to 22% after balancing the dataset, reaching 73.7% f-measure for the complement class, having made use only a phrase chunker and basic morphological information for preprocessing.