How verb subcategorization frequencies are affected by corpus choice

  • Authors:
  • Douglas Roland;Daniel Jurafsky

  • Affiliations:
  • University of Colorado, Boulder, CO;University of Colorado, Boulder, CO

  • Venue:
  • COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
  • Year:
  • 1998

Quantified Score

Hi-index 0.00

Visualization

Abstract

The probabilistic relation between verbs and their arguments plays an important role in modern statistical parsers and supertaggers, and in psychological theories of language processing. But these probabilities are computed in very different ways by the two sets of researchers. Computational linguists compute verb subcategorization probabilities from large corpora while psycholinguists compute them from psychological studies (sentence production and completion tasks). Recent studies have found differences between corpus frequencies and psycholinguistic measures. We analyze subcategorization frequencies from four different corpora: psychological sentence production data (Connine et al. 1984), written text (Brown and WSJ), and telephone conversation data (Switchboard). We find two different sources for the differences. Discourse influence is a result of how verb use is affected by different discourse types such as narrative, connected discourse, and single sentence productions. Semantic influence is a result of different corpora using different senses of verbs, which have different subcategorization frequencies. We conclude that verb sense and discourse type play an important role in the frequencies observed in different experimental and corpus based sources of verb subcategorization frequencies.