Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks

Authors:
Ruth O'Donovan;Michael Burke;Aoife Cahill;Josef Van Genabith;Andy Way
Affiliations:
-;-;-;-;-
Venue:
Computational Linguistics
Year:
2005

Citing 19
Cited 7

Natural language parsing as statistical pattern recognition

Natural language parsing as statistical pattern recognition
Structural ambiguity and lexical relations

Computational Linguistics - Special issue on using large corpora: I
From grammar to lexicon: unsupervised learning of lexical syntax

Computational Linguistics - Special issue on using large corpora: II
Automatic extraction of subcategorization from corpora

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Three generative, lexicalised models for statistical parsing

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
How verb subcategorization frequencies are affected by corpus choice

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Compacting the Penn Treebank grammar

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
The derivation of a grammatically indexed lexicon from the Longman Dictionary of Contemporary English

ACL '87 Proceedings of the 25th annual meeting on Association for Computational Linguistics
Automatic acquisition of a large subcategorization dictionary from corpora

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Statistical decision-tree models for parsing

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Comlex Syntax: building a computational lexicon

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Automatic extraction of subcategorization frames for Czech

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Building a large-scale annotated Chinese corpus

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
The Comlex Syntax project: the first year

HLT '94 Proceedings of the workshop on Human Language Technology
The Penn Treebank: annotating predicate argument structure

HLT '94 Proceedings of the workshop on Human Language Technology
Long-distance dependency resolution in automatically acquired wide-coverage PCFG-based LFG approximations

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Large-scale induction and evaluation of lexical resources from the Penn-II treebank

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Tree-bank grammars

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 2
Corpus-Oriented grammar development for acquiring a head-driven phrase structure grammar from the penn treebank

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing

Morphology-syntax interface for Turkish LFG

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Creating a CCGbank and a wide-coverage CCG lexicon for German

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank

Computational Linguistics
IRASubcat, a highly customizable, language independent tool for the acquisition of verbal subcategorization information from corpus

YIWCALA '10 Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas
Acquisition of unknown word paradigms for large-scale grammars

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Analysis of the difficulties in Chinese deep parsing

IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
French parsing enhanced with a word clustering method based on a syntactic lexicon

SPMRL '11 Proceedings of the Second Workshop on Statistical Parsing of Morphologically Rich Languages

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a methodology for extracting subcategorization frames based on an automatic lexical-functional grammar (LFG) f-structure annotation algorithm for the Penn-II and Penn-III Treebanks. We extract syntactic-function-based subcategorization frames (LFG semantic forms) and traditional CFG category-based subcategorization frames as well as mixed function/category-based frames, with or without preposition information for obliques and particle information for particle verbs. Our approach associates probabilities with frames conditional on the lemma, distinguishes between active and passive frames, and fully reflects the effects of long-distance dependencies in the source data structures. In contrast to many other approaches, ours does not predefine the subcategorization frame types extracted, learning them instead from the source data. Including particles and prepositions, we extract 21,005 lemma frame types for 4,362 verb lemmas, with a total of 577 frame types and an average of 4.8 frame types per verb. We present a large-scale evaluation of the complete set of forms extracted against the full COMLEX resource. To our knowledge, this is the largest and most complete evaluation of subcategorization frames acquired automatically for English.