Glen, Glenda or Glendale: unsupervised and semi-supervised learning of English noun gender

  • Authors:
  • Shane Bergsma;Dekang Lin;Randy Goebel

  • Affiliations:
  • University of Alberta, Alberta, Canada;Google, Inc., Mountain View, California;University of Alberta, Alberta, Canada

  • Venue:
  • CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

English pronouns like he and they reliably reflect the gender and number of the entities to which they refer. Pronoun resolution systems can use this fact to filter noun candidates that do not agree with the pronoun gender. Indeed, broad-coverage models of noun gender have proved to be the most important source of world knowledge in automatic pronoun resolution systems. Previous approaches predict gender by counting the co-occurrence of nouns with pronouns of each gender class. While this provides useful statistics for frequent nouns, many infrequent nouns cannot be classified using this method. Rather than using co-occurrence information directly, we use it to automatically annotate training examples for a large-scale discriminative gender model. Our model collectively classifies all occurrences of a noun in a document using a wide variety of contextual, morphological, and categorical gender features. By leveraging large volumes of un-labeled data, our full semi-supervised system reduces error by 50% over the existing state-of-the-art in gender classification.