A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension

  • Authors:
  • Hugo Hernault;Danushka Bollegala;Mitsuru Ishizuka

  • Affiliations:
  • The University of Tokyo, Bunkyo-ku, Tokyo, Japan;The University of Tokyo, Bunkyo-ku, Tokyo, Japan;The University of Tokyo, Bunkyo-ku, Tokyo, Japan

  • Venue:
  • EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, unlabeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of co-occurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Tree-bank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.