How verb subcategorization frequencies are affected by corpus choice

Authors:
Douglas Roland;Daniel Jurafsky
Affiliations:
University of Colorado, Boulder, CO;University of Colorado, Boulder, CO
Venue:
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Year:
1998

Citing 10
Cited 10

Using register-diversified corpora for general language studies

Computational Linguistics - Special issue on using large corpora: II
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Three generative, lexicalised models for statistical parsing

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Automatic extraction of subcorpora based on subcategorization frames from a part-of-speech tagged corpus

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Automatic acquisition of a large subcategorization dictionary from corpora

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
A new statistical parser based on bigram lexical dependencies

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Disambiguation of super parts of speech (or supertags): almost parsing

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
One sense per discourse

HLT '91 Proceedings of the workshop on Speech and Natural Language
The Penn Treebank: annotating predicate argument structure

HLT '94 Proceedings of the workshop on Human Language Technology
Statistical parsing with a context-free grammar and word statistics

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence

Automatic verb classification using distributions of grammatical features

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Acquiring lexical generalizations from corpora: a case study for diathesis alternations

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks

Computational Linguistics
Verb subcategorization frequency differences between business-news and balanced corpora: the role of verb sense

WCC '00 Proceedings of the workshop on Comparing corpora - Volume 9
Parsing and subcategorization data

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Parsing and subcategorization data

COLING ACL '06 Proceedings of the 21st International Conference on computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Verb subcategorization frequency differences between business-news and balanced corpora: the role of verb sense

CompareCorpora '00 Proceedings of the Workshop on Comparing Corpora
Robust extraction of subcategorization data from spoken language

Parsing '05 Proceedings of the Ninth International Workshop on Parsing Technology
Exploring variations across biomedical subdomains

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Learning syntactic verb frames using graphical models

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

The probabilistic relation between verbs and their arguments plays an important role in modern statistical parsers and supertaggers, and in psychological theories of language processing. But these probabilities are computed in very different ways by the two sets of researchers. Computational linguists compute verb subcategorization probabilities from large corpora while psycholinguists compute them from psychological studies (sentence production and completion tasks). Recent studies have found differences between corpus frequencies and psycholinguistic measures. We analyze subcategorization frequencies from four different corpora: psychological sentence production data (Connine et al. 1984), written text (Brown and WSJ), and telephone conversation data (Switchboard). We find two different sources for the differences. Discourse influence is a result of how verb use is affected by different discourse types such as narrative, connected discourse, and single sentence productions. Semantic influence is a result of different corpora using different senses of verbs, which have different subcategorization frequencies. We conclude that verb sense and discourse type play an important role in the frequencies observed in different experimental and corpus based sources of verb subcategorization frequencies.