Combining compound recognition and PCFG-LA parsing with word lattices and conditional random fields

Authors:
Matthieu Constant;Joseph Le Roux;Anthony Sigogne
Affiliations:
Université Paris-Est, LIGM, CNRS;Université Paris-Nord, LIPN, CNRS;Université Paris-Est, LIGM, CNRS
Venue:
ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 2
Year:
2013

Citing 21
Cited 0

Procedure for quantitatively comparing the syntactic coverage of English grammars

HLT '91 Proceedings of the workshop on Speech and Natural Language
INTEX: an FST toolbox

Theoretical Computer Science - Special issue on implementing automata
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Multiword Expressions: A Pain in the Neck for NLP

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Lexicon-grammar: the representation of compound words

COLING '86 Proceedings of the 11th coference on Computational linguistics
Probabilistic CFG with latent annotations

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Lexicalization in crosslinguistic probabilistic parsing: the case of French

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Spotting subsequences matching a HMM using the average observation probability criteria with application to keyword spotting

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3
Joint parsing and named entity recognition

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Improving generative statistical parsing with semi-supervised word clustering

IWPT '09 Proceedings of the 11th International Conference on Parsing Technologies
Products of random latent variable grammars

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Practical very large scale CRFs

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Handling unknown words in statistical latent-variable parsing models for Arabic, English and French

SPMRL '10 Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
Decreasing lexical data sparsity in statistical syntactic parsing: experiments with named entities

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Tree-rewriting models of multi-word expressions

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
MWU-aware part-of-speech tagging with a CRF model and lexical resources

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
An n-gram frequency database reference to handle MWE extraction in NLP applications

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Multiword expression identification with tree substitution grammars: a parsing tour de force with French

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Multiword expressions in statistical dependency parsing

SPMRL '11 Proceedings of the Second Workshop on Statistical Parsing of Morphologically Rich Languages
Discriminative strategies to integrate multiword expression recognition and parsing

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Spectral learning of latent-variable PCFGs

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

The integration of compounds in a parsing procedure has been shown to improve accuracy in an artificial context where such expressions have been perfectly preidentified. This article evaluates two empirical strategies to incorporate such multiword units in a real PCFG-LA parsing context: (1) the use of a grammar including compound recognition, thanks to specialized annotation schemes for compounds; (2) the use of a state-of-the-art discriminative compound prerecognizer integrating endogenous and exogenous features. We show how these two strategies can be combined with word lattices representing possible lexical analyses generated by the recognizer. The proposed systems display significant gains in terms of multiword recognition and often in terms of standard parsing accuracy. Moreover, we show through an Oracle analysis that this combined strategy opens promising new research directions.