Opinion mining from noisy text data

  • Authors:
  • Lipika Dey;S K Mirajul Haque

  • Affiliations:
  • TCS Innovation Lab Delhi, Udyog Vihar, Gurgaon, India;TCS Innovation Lab Delhi, Udyog Vihar, Gurgaon, India

  • Venue:
  • Proceedings of the second workshop on Analytics for noisy unstructured text data
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The proliferation of Internet has not only generated huge volumes of unstructured information in the form of web documents, but a large amount of text is also generated in the form of emails, blogs, and feedbacks etc. The data generated from online communication acts as potential gold mines for discovering knowledge. Text analytics has matured and is being successfully employed to mine important information from unstructured text documents. Most of these techniques use Natural Language Processing techniques which assume that the underlying text is clean and correct. Statistical techniques, though not as accurate as linguistic mechanisms, are also employed for the purpose to overcome the dependence on clean text. The chief bottleneck for designing statistical mechanisms is however its dependence on appropriately annotated training data. None of these methodologies are suitable for mining information from online communication text data due to the fact that they are often noisy. These texts are informally written. They suffer from spelling mistakes, grammatical errors, improper punctuation and irrational capitalization. This paper focuses on opinion extraction from noisy text data. It is aimed at extracting and consolidating opinions of customers from blogs and feedbacks, at multiple levels of granularity. Ours is a hybrid approach, in which we initially employ a semi-supervised method to learn domain knowledge from a training repository which contains both noisy and clean text. Thereafter we employ localized linguistic techniques to extract opinion expressions from noisy text. We have developed a system based on this approach, which provides the user with a platform to analyze opinion expressions extracted from a repository.