Dealing with Small, Noisy and Imbalanced Data

Authors:
Adam Przepiórkowski;Michał Marcińczuk;Łukasz Degórski
Affiliations:
Institute of Computer Science, Polish Academy of Sciences, Warsaw, and Institute of Informatics, Warsaw University,;Institute of Applied Informatics, Wrocław University of Technology,;Institute of Computer Science, Polish Academy of Sciences, Warsaw,
Venue:
TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Year:
2008

Citing 5
Cited 3

Random Forests

Machine Learning
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Definition Extraction with Balanced Random Forests

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies

ACL '07 Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies
Towards the automatic extraction of definitions in Slavic

ACL '07 Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies

Definition Extraction with Balanced Random Forests

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Language independent system for definition extraction: first results using learning algorithms

WDE '09 Proceedings of the 1st Workshop on Definition Extraction
Definition extraction using linguistic and structural features

WDE '09 Proceedings of the 1st Workshop on Definition Extraction

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper deals with the task of definition extraction with the training corpus suffering from the problems of small size, high noise and heavy imbalance. A previous approach, based on manually constructed shallow grammars, turns out to be hard to better even by such robust classifiers as SVMs, AdaBoost and simple ensembles of classifiers. However, a linear combination of various such classifiers and manual grammars significantly improves the results of the latter.