Definition Extraction with Balanced Random Forests

Authors:
Łukasz Kobyliński;Adam Przepiórkowski
Affiliations:
Institute of Computer Science, Warsaw University of Technology, Warszawa, Poland 00-665;Institute of Computer Science, Polish Academy of Sciences, Warszawa, Poland 01-237 and Institute of Informatics, University of Warsaw, Warszawa, Poland 02-097
Venue:
GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Year:
2008

Citing 5
Cited 6

Random Forests

Machine Learning
Learning to identify single-snippet answers to definition questions

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Dealing with Small, Noisy and Imbalanced Data

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Towards the automatic extraction of definitions in Slavic

ACL '07 Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies
Automatic extraction of definitions from German court decisions

IEBeyondDoc '06 Proceedings of the Workshop on Information Extraction Beyond The Document

Dealing with Small, Noisy and Imbalanced Data

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Evolutionary algorithms for definition extraction

WDE '09 Proceedings of the 1st Workshop on Definition Extraction
Language independent system for definition extraction: first results using learning algorithms

WDE '09 Proceedings of the 1st Workshop on Definition Extraction
Definition extraction using linguistic and structural features

WDE '09 Proceedings of the 1st Workshop on Definition Extraction
Exploring discrepancies in findings obtained with the KDD Cup '99 data set

Intelligent Data Analysis
Automatic extraction of prerequisites and learning outcome from learning material

International Journal of Metadata, Semantics and Ontologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a novel machine learning approach to the task of identifying definitions in Polish documents. Specifics of the problem domain and characteristics of the available dataset have been taken into consideration, by carefully choosing and adapting a classification method to highly imbalanced and noisy data. We evaluate the performance of a Random Forest-based classifier in extracting definitional sentences from natural language text and give a comparison with previous work.