Impact of imperfect OCR on part-of-speech tagging

Authors:
Xiaofan Lin
Affiliations:
-
Venue:
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Year:
2003

Citing 7
Cited 0

Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Improving accuracy in word class tagging through the combination of machine learning systems

Computational Linguistics
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Noun phrase recognition by system combination

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Classifier combination for improved lexical disambiguation

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Improving data driven wordclass tagging by system combination

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Part-of-speech (POS) tagging is the foundation ofnatural language processing (NLP) systems, and thus hasbeen an active area of research for many years. However,one question remains unanswered: How will a POStagger behave when the input text is not error-free? Thisissue can be of great importance when the text comesfrom imperfect sources like Optical CharacterRecognition (OCR). This paper analyzes the performanceof both individual POS taggers and combination systemson imperfect text. Experimental results show that a POStagger's accuracy will decrease linearly with thecharacter error rate and the slope indicates a tagger'ssensitivity to input text errors.