Improving generative statistical parsing with semi-supervised word clustering

Authors:
Marie Candito;Benoît Crabbé
Affiliations:
Université Paris/INRIA (Alpage), Paris;Université Paris/INRIA (Alpage), Paris
Venue:
IWPT '09 Proceedings of the 11th International Conference on Parsing Technologies
Year:
2009

Citing 6
Cited 16

Class-based n-gram models of natural language

Computational Linguistics
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Probabilistic CFG with latent annotations

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Learning accurate, compact, and interpretable tree annotation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Enhancing unlexicalized parsing performance using a wide coverage lexicon, fuzzy tag-set mapping, and EM-HMM-based lexical probabilities

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
On statistical parsing of French with supervised and semi-supervised strategies

CLAGI '09 Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference

Word representations: a simple and general method for semi-supervised learning

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Handling unknown words in statistical latent-variable parsing models for Arabic, English and French

SPMRL '10 Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
Parsing word clusters

SPMRL '10 Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
Lemmatization and lexicalized statistical parsing of morphologically rich languages: the case of French

SPMRL '10 Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
Benchmarking of statistical dependency parsers for French

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Language models as representations for weakly-supervised NLP tasks

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Decreasing lexical data sparsity in statistical syntactic parsing: experiments with named entities

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Multiword expression identification with tree substitution grammars: a parsing tour de force with French

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A word clustering approach to domain adaptation: effective parsing of biomedical texts

IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
French parsing enhanced with a word clustering method based on a syntactic lexicon

SPMRL '11 Proceedings of the Second Workshop on Statistical Parsing of Morphologically Rich Languages
Discriminative strategies to integrate multiword expression recognition and parsing

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Semi-supervised dependency parsing using lexical affinities

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Biased representation learning for domain adaptation

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Word segmentation, unknown-word resolution, and morphological agreement in a hebrew parsing system

Computational Linguistics
Parsing models for identifying multiword expressions

Computational Linguistics
Combining compound recognition and PCFG-LA parsing with word lattices and conditional random fields

ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a semi-supervised method to improve statistical parsing performance. We focus on the well-known problem of lexical data sparseness and present experiments of word clustering prior to parsing. We use a combination of lexicon-aided morphological clustering that preserves tagging ambiguity, and unsupervised word clustering, trained on a large unannotated corpus. We apply these clusterings to the French Treebank, and we train a parser with the PCFG-LA unlexicalized algorithm of (Petrov et al., 2006). We find a gain in French parsing performance: from a baseline of F1=86.76% to F1=87.37% using morphological clustering, and up to F1=88.29% using further unsupervised clustering. This is the best known score for French probabilistic parsing. These preliminary results are encouraging for statistically parsing morphologically rich languages, and languages with small amount of annotated data.