A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension

Authors:
Hugo Hernault;Danushka Bollegala;Mitsuru Ishizuka
Affiliations:
The University of Tokyo, Bunkyo-ku, Tokyo, Japan;The University of Tokyo, Bunkyo-ku, Tokyo, Japan;The University of Tokyo, Bunkyo-ku, Tokyo, Japan
Venue:
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Year:
2010

Citing 20
Cited 2

The nature of statistical learning theory

The nature of statistical learning theory
Foundations of statistical natural language processing

Foundations of statistical natural language processing
The Theory and Practice of Discourse Parsing and Summarization

The Theory and Practice of Discourse Parsing and Summarization
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Latent Semantic Kernels

Journal of Intelligent Information Systems
Support Vector Machines Based on a Semantic Kernel for Text Categorization

IJCNN '00 Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00)-Volume 5 - Volume 5
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Statistical decision-tree models for parsing

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
An unsupervised approach to recognizing discourse relations

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Sentence level discourse parsing using syntactic and lexical information

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Building a discourse-tagged corpus in the framework of Rhetorical Structure Theory

SIGDIAL '01 Proceedings of the Second SIGdial Workshop on Discourse and Dialogue - Volume 16
NLTK: the Natural Language Toolkit

ETMTNLP '02 Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1
Semantic Kernels for Text Classification Based on Topological Measures of Feature Similarity

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
T2D: Generating Dialogues Between Virtual Agents Automatically from Text

IVA '07 Proceedings of the 7th international conference on Intelligent Virtual Agents
Opinion Mining and Sentiment Analysis

Foundations and Trends in Information Retrieval
A novel discourse parser based on support vector machine classification

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Automatic sense prediction for implicit discourse relations in text

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Supervised and unsupervised methods in employing discourse relations for improving opinion polarity classification

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Recognizing implicit discourse relations in the Penn Discourse Treebank

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
A sequential model for discourse segmentation

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing

Semi-supervised discourse relation classification with structural learning

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
Text-level discourse parsing with rich linguistic features

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, unlabeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of co-occurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Tree-bank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.