Grammatical analysis by computer of the Lancaster-Oslo/Bergen (LOB) corpus of British English texts

Authors:
Andrew David Beale
Affiliations:
University of Lancaster, Lancaster, England
Venue:
ACL '85 Proceedings of the 23rd annual meeting on Association for Computational Linguistics
Year:
1985

Citing 1
Cited 2

Choice of grammatical word-class without global syntactic analysis: tagging words in the LOB Corpus.

Computers and the Humanities

Grammatical category disambiguation by statistical optimization

Computational Linguistics
A probabilistic parser

EACL '85 Proceedings of the second conference on European chapter of the Association for Computational Linguistics

Quantified Score

Hi-index	0.01

Visualization

Abstract

Research has been under way at the unit for Computer Research on the English Language at the University of Lancaster, England, to develop a suite of computer programs which provide a detailed grammatical analysis of the LOB corpus, a collection of about 1 million words of British English texts available in machine readable form.The first phrase of the project, completed in September 1983, produced a grammatically annotated version of the corpus giving a tag showing the word class of each word token. Over 93 per cent of the word tags were correctly selected by using a matrix of tag pair probabilities and this figure was upgraded by a further 3 per cent by retagging problematic strings of words prior to disambiguation and by altering the probability weightings for sequences of three tags. The remaining 3 to 4 per cent were corrected by a human post-editor.The system was originally designed to run in batch mode over the corpus but we have recently modified procedures to run interactively for sample sentences typed in by a user at a terminal. We are currently extending the word tag set and improving the word tagging procedures to further reduce manual intervention. A similar probabilistic system is being developed for phrase and clause tagging.