A two-phase hybrid of semi-supervised and active learning approach for sequence labeling

  • Authors:
  • Hamed Hassanzadeh;Mohammadreza Keyvanpour

  • Affiliations:
  • Young Researchers Club, Qazvin Branch, Islamic Azad University, Qazvin, Iran;Department of Computer Engineering, Alzahra University, Tehran, Iran

  • Venue:
  • Intelligent Data Analysis
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In recent years, many NLP systems and tasks are developed using machine learning methods. In order to achieve the best performance, these systems are generally trained on a large human annotated corpus. Since annotating such corpora is a very expensive and time-consuming procedure, manually annotating corpora is become one of the significant issues in many text based tasks such as text mining, semantic annotation, Named Entity Recognition and generally Information Extraction. Semi-supervised Learning and Active Learning are two distinct approaches that deal with reduction of labeling costs. Based on their natures, Active and semi-supervised learning can produce better results when they are jointly applied. In this paper we propose a combined Semi-Supervised and Active Learning approach for Sequence Labeling which extremely reduces manual annotation cost in a way that only highly uncertain tokens need to be manually labeled and other sequences and subsequences are labeled automatically. The proposed approach reduces manual annotation cost around 90% compare with a supervised learning and 30% in contrast with a similar fully active learning approach. Conditional Random Field CRF is chosen as the underlying learning model due to its promising performance in many sequence labeling tasks. In addition we proposed a confidence measure based on the model's variance reduction that reaches a considerable accuracy for finding informative samples.