Identifying patterns for unsupervised grammar induction

Authors:
Jesús Santamaría;Lourdes Araujo
Affiliations:
U. Nacional de Educación a Distancia, Madrid, Spain;U. Nacional de Educación a Distancia, Madrid, Spain
Venue:
CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Year:
2010

Citing 6
Cited 3

Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Experiments in parallel-text based grammar induction

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Unsupervised grammar induction by distribution and attachment

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Unsupervised multilingual grammar induction

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Natural language grammar induction with a generative constituent-context model

Pattern Recognition

Reducing the size of the representation for the uDOP-estimate

EMNLP '11 Proceedings of the First Workshop on Unsupervised Learning in NLP
Semi-supervised constituent grammar induction based on text chunking information

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Semantic separator learning and its applications in unsupervised Chinese text parsing

Frontiers of Computer Science: Selected Publications from Chinese Universities

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper describes a new method for unsupervised grammar induction based on the automatic extraction of certain patterns in the texts. Our starting hypothesis is that there exist some classes of words that function as separators, marking the beginning or the end of new constituents. Among these separators we distinguish those which trigger new levels in the parse tree. If we are able to detect these separators we can follow a very simple procedure to identify the constituents of a sentence by taking the classes of words between separators. This paper is devoted to describe the process that we have followed to automatically identify the set of separators from a corpus only annotated with Part-of-Speech (POS) tags. The proposed approach has allowed us to improve the results of previous proposals when parsing sentences from the Wall Street Journal corpus.