C4.5: programs for machine learning
C4.5: programs for machine learning
A sequential algorithm for training text classifiers
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Fast training of support vector machines using sequential minimal optimization
Advances in kernel methods
Forgetting Exceptions is Harmful in Language Learning
Machine Learning - Special issue on natural language learning
Foundations of statistical natural language processing
Foundations of statistical natural language processing
MetaCost: a general method for making classifiers cost-sensitive
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Dynamically adapting kernels in support vector machines
Proceedings of the 1998 conference on Advances in neural information processing systems II
Improving Identification of Difficult Small Classes by Balancing Class Distribution
AIME '01 Proceedings of the 8th Conference on AI in Medicine in Europe: Artificial Intelligence Medicine
Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
Automatic extraction of subcategorization from corpora
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Distributional clustering of English words
ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Automatic acquisition of hyponyms from large text corpora
COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Automatic extraction of subcategorization frames for Czech
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Taxonomy learning: factoring the structure of a taxonomy into a semantic classification decision
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites
Computational Linguistics
Statistical filtering and subcategorization frame acquisition
EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Automatic distinction of arguments and modifiers: the case of prepositional phrases
ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
Learning argument/adjunct distinction for Basque
ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
Learning with multiple stacking for named entity recognition
COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Named entity recognition through classifier combination
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Learning Greek verb complements: addressing the class imbalance
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
One-sided Sampling for Learning Taxonomic Relations in the Modern Greek Economic Domain
ICTAI '07 Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 02
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Hi-index | 0.00 |
Imbalanced training sets, where one or more classes are heavily underrepresented in the training data compared to the others, prove to be problematic when trying to classify new instances that belong to a rare class. In the present article, class imbalance occurs in the datasets of three language learning applications: automatic identification of verb complements, automatic recognition of semantic entities and learning taxonomic relations. All three applications are tested on Modern Greek text corpora; verb complement identification is applied to English corpora also, and comparative results are presented. The difference in statistical behaviour between the classes may be attributed to the low pre-processing level of the corpora, and/or the automatic nature of the pre-processing phase, and/or the exceptional occurrence of the linguistic information of interest in the data. Two different approaches are experimented with in order to deal with the problem: one-sided sampling (OSS) of the dataset and classification using support vector machines (SVMs). OSS removes redundant, noisy and misleading instances of the majority class, reducing thereby the training set size, while SVMs, without removing any instances, take into account only examples that appear close to the borderline region between the classes. The better results achieved by OSS in all test cases lead to some very interesting observations regarding the impact of the various training instances on classification, depending on their position in the feature-vector space.