Impact of imperfect OCR on part-of-speech tagging

  • Authors:
  • Xiaofan Lin

  • Affiliations:
  • -

  • Venue:
  • ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Part-of-speech (POS) tagging is the foundation ofnatural language processing (NLP) systems, and thus hasbeen an active area of research for many years. However,one question remains unanswered: How will a POStagger behave when the input text is not error-free? Thisissue can be of great importance when the text comesfrom imperfect sources like Optical CharacterRecognition (OCR). This paper analyzes the performanceof both individual POS taggers and combination systemson imperfect text. Experimental results show that a POStagger's accuracy will decrease linearly with thecharacter error rate and the slope indicates a tagger'ssensitivity to input text errors.