Sample selection for statistical parsers: cognitively driven algorithms and evaluation measures

  • Authors:
  • Roi Reichart;Ari Rappoport

  • Affiliations:
  • Hebrew University of Jerusalem;Hebrew University of Jerusalem

  • Venue:
  • CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Creating large amounts of manually annotated training data for statistical parsers imposes heavy cognitive load on the human annotator and is thus costly and error prone. It is hence of high importance to decrease the human efforts involved in creating training data without harming parser performance. For constituency parsers, these efforts are traditionally evaluated using the total number of constituents (TC) measure, assuming uniform cost for each annotated item. In this paper, we introduce novel measures that quantify aspects of the cognitive efforts of the human annotator that are not reflected by the TC measure, and show that they are well established in the psycholinguistic literature. We present a novel parameter based sample selection approach for creating good samples in terms of these measures. We describe methods for global optimisation of lexical parameters of the sample based on a novel optimisation problem, the constrained multiset multicover problem, and for cluster-based sampling according to syntactic parameters. Our methods outperform previously suggested methods in terms of the new measures, while maintaining similar TC performance.