The rhetorical parsing of unrestricted texts: a surface-based approach
Computational Linguistics
Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
A generative constituent-context model for improved grammar induction
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Statistical phrase-based translation
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
The unsupervised learning of natural language structure
The unsupervised learning of natural language structure
Unsupervised induction of stochastic context-free grammars using distributional clustering
ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
Corpus-based induction of syntactic structure: models of dependency and constituency
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Prototype-driven grammar induction
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A hierarchical Bayesian language model based on Pitman-Yor processes
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Probabilistic Models for Action-Based Chinese Dependency Parsing
ECML '07 Proceedings of the 18th European conference on Machine Learning
Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Improving unsupervised dependency parsing with richer contexts and smoothing
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A note on the implementation of hierarchical dirichlet processes
ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Bayesian inference for finite-state transducers
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
From baby steps to Leapfrog: how "Less is More" in unsupervised dependency parsing
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Improvements in unsupervised co-occurrence based parsing
CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Unsupervised induction of tree substitution grammars for dependency parsing
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Inducing Tree-Substitution Grammars
The Journal of Machine Learning Research
Simple unsupervised grammar induction from raw text with cascaded finite state models
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
The Journal of Machine Learning Research
Bayesian symbol-refined tree substitution grammars for syntactic parsing
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
A feature-rich constituent context model for grammar induction
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Smoothing for bracketing induction
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Hi-index | 0.00 |
Constituent Context Model (CCM) is an effective generative model for grammar induction, the aim of which is to induce hierarchical syntactic structure from natural text. The CCM simply defines the Multinomial distribution over constituents, which leads to a severe data sparse problem because long constituents are unlikely to appear in unseen data sets. This paper proposes a Bayesian method for constituent smoothing by defining two kinds of prior distributions over constituents: the Dirichlet prior and the Pitman-Yor Process prior. The Dirichlet prior functions as an additive smoothing method, and the PYP prior functions as a back-off smoothing method. Furthermore, a modified CCM is proposed to differentiate left constituents and right constituents in binary branching trees. Experiments show that both the proposed Bayesian smoothing method and the modified CCM are effective, and combining them attains or significantly improves the state-of-the-art performance of grammar induction evaluated on standard treebanks of various languages.