Error-driven pruning of Treebank grammars for base noun phrase identification

  • Authors:
  • Claire Cardie;David Pierce

  • Affiliations:
  • Cornell University, Ithaca, NY;Cornell University, Ithaca, NY

  • Venue:
  • COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
  • Year:
  • 1998

Quantified Score

Hi-index 0.00

Visualization

Abstract

Finding simple, non-recursive, base noun phrases is an important subtask for many natural language processing applications. While previous empirical methods for base NP identification have been rather complex, this paper instead proposes a very simple algorithm that is tailored to the relative simplicity of the task. In particular, we present a corpus-based approach for finding base NPs by matching part-of-speech tag sequences. The training phase of the algorithm is based on two successful techniques: first the base NP grammar is read from a "treebank" corpus; then the grammar is improved by selecting rules with high "benefit" scores. Using this simple algorithm with a naive heuristic for matching rules, we achieve surprising accuracy in an evaluation on the Penn Treebank Wall Street Journal.