Variable selection in logistic regression: the British English dative alternation

  • Authors:
  • Daphne Theijssen

  • Affiliations:
  • Centre for Language Studies, Radboud University Nijmegen, Nijmegen, The Netherlands

  • Venue:
  • ESSLLI'08/09 Proceedings of the 2008 international conference on Interfaces: explorations in logic, language and computation
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper addresses the problem of selecting the 'optimal' variable subset in a logistic regression model for a medium-sized data set. As a case study, we take the British English dative alternation, where speakers and writers can choose between two - equally grammatical - syntactic constructions to express the same meaning. With 29 explanatory variables taken from the literature, we build two types of models: one with the verb sense included as a random effect, and one without a random effect. For each type, we build three different models by including all variables and keeping the significant ones, by successively adding the most predictive variable (forward selection), and by successively removing the least predictive variable (backward elimination). Seeing that the six approaches lead to six different variable selections (and thus six different models), we conclude that the selection of the 'best' model requires a substantial amount of linguistic expertise.