Adapting svm for data sparseness and imbalance: A case study in information extraction

  • Authors:
  • Yaoyong Li;Kalina Bontcheva;Hamish Cunningham

  • Affiliations:
  • Department of computer science, the university of sheffieldregent court, 211 portobello, sheffield s1 4dp, uk e-mail: yaoyong@dcs.shef.ac.uk, kalina@dcs.shef.ac.uk, hamish@dcs.shef.ac.uk;Department of computer science, the university of sheffieldregent court, 211 portobello, sheffield s1 4dp, uk e-mail: yaoyong@dcs.shef.ac.uk, kalina@dcs.shef.ac.uk, hamish@dcs.shef.ac.uk;Department of computer science, the university of sheffieldregent court, 211 portobello, sheffield s1 4dp, uk e-mail: yaoyong@dcs.shef.ac.uk, kalina@dcs.shef.ac.uk, hamish@dcs.shef.ac.uk

  • Venue:
  • Natural Language Engineering
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Support Vector Machines (SVM) have been used successfully in many Natural Language Processing (NLP) tasks. The novel contribution of this paper is in investigating two techniques for making SVM more suitable for language learning tasks. Firstly, we propose an SVM with uneven margins (SVMUM) model to deal with the problem of imbalanced training data. Secondly, SVM active learning is employed in order to alleviate the difficulty in obtaining labelled training data. The algorithms are presented and evaluated on several Information Extraction (IE) tasks, where they achieved better performance than the standard SVM and the SVM with passive learning, respectively. Moreover, by combining SVMUM with the active learning algorithm, we achieve the best reported results on the seminars and jobs corpora, which are benchmark data sets used for evaluation and comparison of machine learning algorithms for IE. In addition, we also evaluate the token based classification framework for IE with three different entity tagging schemes. In comparison to previous methods dealing with the same problems, our methods are both effective and efficient, which are valuable features for real-world applications. Due to the similarity in the formulation of the learning problem for IE and for other NLP tasks, the two techniques are likely to be beneficial in a wide range of applications1.