Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results

  • Authors:
  • Phani Gadde;L. V. Subramaniam;Tanveer A. Faruquie

  • Affiliations:
  • Language Technologies Research Centre, IIIT-Hyderabad, India;IBM Research, New Delhi, India;IBM Research, New Delhi, India

  • Venue:
  • Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the increase in the number of people communicating through internet, there has been a steady increase in the amount of text available online. Most such text is different from the standard language, as people try to use various kinds of short forms for words to save time and effort. We call that noisy text. Part-Of-Speech (POS) tagging has reached high levels of accuracy enabling the use of automatic POS tags in various language processing tasks, however, tagging performance on noisy text degrades very fast. This paper is an attempt to adapt a state-of-the-art English POS tagger, which is trained on the Wall-Street-Journal (WSJ) corpus, to noisy text. We classify the noise in text into different types and evaluate the tagger with respect to each type of noise. The problem of tagging noisy text is attacked in two ways; a) Trying to overcome noise as a post processing step to the tagging b) Cleaning the noise and then doing tagging. We propose techniques to solve the problem in both the ways and critically compare them based on the error analysis. We demonstrate the working of the proposed models on a Short Message Service (SMS) dataset which achieve a significant improvement over the baseline accuracy of tagging noisy words by a state-of-the-art English POS tagger.