The effect of borderline examples on language learning

Authors:
Katia Lida Kermanidis
Affiliations:
Department of Informatics, Ionian University, Corfu, Greece
Venue:
Journal of Experimental & Theoretical Artificial Intelligence
Year:
2009

Citing 24
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Forgetting Exceptions is Harmful in Language Learning

Machine Learning - Special issue on natural language learning
Foundations of statistical natural language processing

Foundations of statistical natural language processing
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Dynamically adapting kernels in support vector machines

Proceedings of the 1998 conference on Advances in neural information processing systems II
Improving Identification of Difficult Small Classes by Balancing Class Distribution

AIME '01 Proceedings of the 8th Conference on AI in Medicine in Europe: Artificial Intelligence Medicine
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Automatic extraction of subcategorization from corpora

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Automatic extraction of subcategorization frames for Czech

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Taxonomy learning: factoring the structure of a taxonomy into a semantic classification decision

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites

Computational Linguistics
Statistical filtering and subcategorization frame acquisition

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Automatic distinction of arguments and modifiers: the case of prepositional phrases

ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
Learning argument/adjunct distinction for Basque

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
Learning with multiple stacking for named entity recognition

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Named entity recognition through classifier combination

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Memory-based one-step named-entity recognition: effects of seed list features, classifier stacking, and unannotated data

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Learning Greek verb complements: addressing the class imbalance

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
One-sided Sampling for Learning Taxonomic Relations in the Modern Greek Economic Domain

ICTAI '07 Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 02
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Imbalanced training sets, where one or more classes are heavily underrepresented in the training data compared to the others, prove to be problematic when trying to classify new instances that belong to a rare class. In the present article, class imbalance occurs in the datasets of three language learning applications: automatic identification of verb complements, automatic recognition of semantic entities and learning taxonomic relations. All three applications are tested on Modern Greek text corpora; verb complement identification is applied to English corpora also, and comparative results are presented. The difference in statistical behaviour between the classes may be attributed to the low pre-processing level of the corpora, and/or the automatic nature of the pre-processing phase, and/or the exceptional occurrence of the linguistic information of interest in the data. Two different approaches are experimented with in order to deal with the problem: one-sided sampling (OSS) of the dataset and classification using support vector machines (SVMs). OSS removes redundant, noisy and misleading instances of the majority class, reducing thereby the training set size, while SVMs, without removing any instances, take into account only examples that appear close to the borderline region between the classes. The better results achieved by OSS in all test cases lead to some very interesting observations regarding the impact of the various training instances on classification, depending on their position in the feature-vector space.