Sample selection for statistical parsers: cognitively driven algorithms and evaluation measures

Authors:
Roi Reichart;Ari Rappoport
Affiliations:
Hebrew University of Jerusalem;Hebrew University of Jerusalem
Venue:
CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Year:
2009

Citing 18
Cited 2

A computational theory of human linguistic processing: memory limitations and processing breakdown

A computational theory of human linguistic processing: memory limitations and processing breakdown
Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
Approximation algorithms for NP-hard problems

Approximation algorithms for NP-hard problems
Scaling question answering to the Web

Proceedings of the 10th international conference on World Wide Web
Head-driven statistical models for natural language parsing

Head-driven statistical models for natural language parsing
Information Theory, Inference & Learning Algorithms

Information Theory, Inference & Learning Algorithms
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Active learning for statistical natural language parsing

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Rule writing or annotation: cost-efficient resource usage for base noun phrase chunking

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Intricacies of Collins' Parsing Model

Computational Linguistics
Sample Selection for Statistical Parsing

Computational Linguistics
An Expected Utility Approach to Active Feature-Value Acquisition

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Joint learning improves semantic role labeling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Syntactic complexity measures for detecting mild cognitive impairment

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
Active learning with sampling by uncertainty and density for word sense disambiguation and text classification

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
SPMT: statistical machine translation with syntactified target language phrases

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
A two-stage method for active learning of statistical grammars

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Parsing biomedical literature

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

Improved fully unsupervised parsing with zoomed learning

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
ULISSE: an unsupervised algorithm for detecting reliable dependency parses

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Creating large amounts of manually annotated training data for statistical parsers imposes heavy cognitive load on the human annotator and is thus costly and error prone. It is hence of high importance to decrease the human efforts involved in creating training data without harming parser performance. For constituency parsers, these efforts are traditionally evaluated using the total number of constituents (TC) measure, assuming uniform cost for each annotated item. In this paper, we introduce novel measures that quantify aspects of the cognitive efforts of the human annotator that are not reflected by the TC measure, and show that they are well established in the psycholinguistic literature. We present a novel parameter based sample selection approach for creating good samples in terms of these measures. We describe methods for global optimisation of lexical parameters of the sample based on a novel optimisation problem, the constrained multiset multicover problem, and for cluster-based sampling according to syntactic parameters. Our methods outperform previously suggested methods in terms of the new measures, while maintaining similar TC performance.