Natural language grammar induction with a generative constituent-context model

Authors:
Dan Klein;Christopher D. Manning
Affiliations:
Computer Science Department, Stanford University, 353 Serra Mall, Room 418, Stanford, CA 94305-9040, USA;Computer Science Department, Stanford University, 353 Serra Mall, Room 418, Stanford, CA 94305-9040, USA
Venue:
Pattern Recognition
Year:
2005

Citing 24
Cited 11

On the computational complexity of approximating distributions by probabilistic automata

COLT '90 Proceedings of the third annual workshop on Computational learning theory
Procedure for quantitatively comparing the syntactic coverage of English grammars

HLT '91 Proceedings of the workshop on Speech and Natural Language
On the learnability of discrete distributions

STOC '94 Proceedings of the twenty-sixth annual ACM symposium on Theory of computing
Natural Language Processing in LISP: An Introduction to Computational Linguistics

Natural Language Processing in LISP: An Introduction to Computational Linguistics
Results of the Abbadingo One DFA Learning Competition and a New Evidence-Driven State Merging Algorithm

ICGI '98 Proceedings of the 4th International Colloquium on Grammatical Inference
Inducing Probabilistic Grammars by Bayesian Model Merging

ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Identification of DFA: data-dependent vs data-independent algorithms

ICG! '96 Proceedings of the 3rd International Colloquium on Grammatical Inference: Learning Syntax from Sentences
A study of grammatical inference

A study of grammatical inference
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
PCFG models of linguistic tree representations

Computational Linguistics
An annotation scheme for free word order languages

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Distributional part-of-speech tagging

EACL '95 Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics
Three generative, lexicalised models for statistical parsing

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
A model of early syntactic development

ACL '82 Proceedings of the 20th annual meeting on Association for Computational Linguistics
Automatic grammar induction and parsing free text: a transformation-based approach

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Bayesian grammar induction for language modeling

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Inside-outside reestimation from partially bracketed corpora

ACL '92 Proceedings of the 30th annual meeting on Association for Computational Linguistics
A production system model of first language acquisition

COLING '80 Proceedings of the 8th conference on Computational linguistics
ABL: alignment-based learning

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Building a large-scale annotated Chinese corpus

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Inducing syntactic categories by context distribution clustering

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Unsupervised induction of stochastic context-free grammars using distributional clustering

ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
Distributional phrase structure induction

ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
Tree-bank grammars

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 2

A generative constituent-context model for improved grammar induction

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
An all-subtrees approach to unsupervised parsing

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
AN UNSUPERVISED INCREMENTAL LEARNING ALGORITHM FOR DOMAIN-SPECIFIC LANGUAGE DEVELOPMENT

Applied Artificial Intelligence
Limitations of current grammar induction algorithms

ACL '07 Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop
Unsupervised parsing with U-DOP

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
A linguistic investigation into unsupervised DOP

CACLA '07 Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition
Identifying patterns for unsupervised grammar induction

CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Finite state grammar transduction from distributed collected knowledge

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Computational models of language acquisition

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Empiricist solutions to nativist puzzles by means of unsupervised TSG

Proceedings of the Workshop on Computational Models of Language Acquisition and Loss
Semi-supervised constituent grammar induction based on text chunking information

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

Quantified Score

Hi-index	0.01

Visualization

Abstract

We present a generative probabilistic model for the unsupervised learning of hierarchical natural language syntactic structure. Unlike most previous work, we do not learn a context-free grammar, but rather induce a distributional model of constituents which explicitly relates constituent yields and their linear contexts. Parameter search with EM produces higher quality analyses for human language data than those previously exhibited by unsupervised systems, giving the best published unsupervised parsing results on the ATIS corpus. Experiments on Penn treebank sentences of comparable length show an even higher constituent F"1 of 71% on non-trivial brackets. We compare distributionally induced and actual part-of-speech tags as input data, and examine extensions to the basic model. We discuss errors made by the system, compare the system to previous models, and discuss upper bounds, lower bounds, and stability for this task.