Parsing word clusters

Authors:
Marie Candito;Djamé Seddah
Affiliations:
Alpage (Université Paris), Paris, France;Alpage (Université Paris), Paris, France and Université Paris-Sorbonne, Paris, France
Venue:
SPMRL '10 Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
Year:
2010

Citing 12
Cited 8

Class-based n-gram models of natural language

Computational Linguistics
Large Margin Classification Using the Perceptron Algorithm

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
Statistical decision-tree models for parsing

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Accurate unlexicalized parsing

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Probabilistic CFG with latent annotations

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Learning accurate, compact, and interpretable tree annotation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Enhancing unlexicalized parsing performance using a wide coverage lexicon, fuzzy tag-set mapping, and EM-HMM-based lexical probabilities

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Scalable discriminative parsing for German

IWPT '09 Proceedings of the 11th International Conference on Parsing Technologies
Improving generative statistical parsing with semi-supervised word clustering

IWPT '09 Proceedings of the 11th International Conference on Parsing Technologies
Cross parser evaluation and tagset variation: a French treebank study

IWPT '09 Proceedings of the 11th International Conference on Parsing Technologies
Clustering words by syntactic similarity improves dependency parsing of predicate-argument structures

IWPT '09 Proceedings of the 11th International Conference on Parsing Technologies
Lemmatization and lexicalized statistical parsing of morphologically rich languages: the case of French

SPMRL '10 Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages

Statistical parsing of morphologically rich languages (SPMRL): what, how and whither

SPMRL '10 Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
Benchmarking of statistical dependency parsers for French

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Improving dependency parsing with semantic classes

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
A word clustering approach to domain adaptation: effective parsing of biomedical texts

IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
French parsing enhanced with a word clustering method based on a syntactic lexicon

SPMRL '11 Proceedings of the Second Workshop on Statistical Parsing of Morphologically Rich Languages
Semi-supervised dependency parsing using lexical affinities

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Data-driven parsing using probabilistic linear context-free rewriting systems

Computational Linguistics
Parsing models for identifying multiword expressions

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present and discuss experiments in statistical parsing of French, where terminal forms used during training and parsing are replaced by more general symbols, particularly clusters of words obtained through un-supervised linear clustering. We build on the work of Candito and Crabbé (2009) who proposed to use clusters built over slightly coarsened French inflected forms. We investigate the alternative method of building clusters over lemma/part-of-speech pairs, using a raw corpus automatically tagged and lemmatized. We find that both methods lead to comparable improvement over the baseline (we obtain F1=86.20% and F1=86.21% respectively, compared to a baseline of F1=84.10%). Yet, when we replace gold lemma/POS pairs with their corresponding cluster, we obtain an upper bound (F1=87.80) that suggests room for improvement for this technique, should tag-ging/lemmatisation performance increase for French. We also analyze the improvement in performance for both techniques with respect to word frequency. We find that replacing word forms with clusters improves attachment performance for words that are originally either unknown or low-frequency, since these words are replaced by cluster symbols that tend to have higher frequencies. Furthermore, clustering also helps significantly for medium to high frequency words, suggesting that training on word clusters leads to better probability estimates for these words.