Grammatical analysis by computer of the Lancaster-Oslo/Bergen (LOB) corpus of British English texts

  • Authors:
  • Andrew David Beale

  • Affiliations:
  • University of Lancaster, Lancaster, England

  • Venue:
  • ACL '85 Proceedings of the 23rd annual meeting on Association for Computational Linguistics
  • Year:
  • 1985

Quantified Score

Hi-index 0.01

Visualization

Abstract

Research has been under way at the unit for Computer Research on the English Language at the University of Lancaster, England, to develop a suite of computer programs which provide a detailed grammatical analysis of the LOB corpus, a collection of about 1 million words of British English texts available in machine readable form.The first phrase of the project, completed in September 1983, produced a grammatically annotated version of the corpus giving a tag showing the word class of each word token. Over 93 per cent of the word tags were correctly selected by using a matrix of tag pair probabilities and this figure was upgraded by a further 3 per cent by retagging problematic strings of words prior to disambiguation and by altering the probability weightings for sequences of three tags. The remaining 3 to 4 per cent were corrected by a human post-editor.The system was originally designed to run in batch mode over the corpus but we have recently modified procedures to run interactively for sample sentences typed in by a user at a terminal. We are currently extending the word tag set and improving the word tagging procedures to further reduce manual intervention. A similar probabilistic system is being developed for phrase and clause tagging.