Computationally effective algorithm for information extraction and online review mining

  • Authors:
  • Boris Kraychev;Ivan Koychev

  • Affiliations:
  • Sofia University St. Kliment Ohridski, Sofia, Bulgaria;Sofia University St. Kliment Ohridski, Sofia, Bulgaria

  • Venue:
  • Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The World Wide Web provides continuous sources of information with similar semantic structure like news feeds, user reviews and user comments on various topics. These sources are essential for the goal of online opinion mining. The paper proposes a computationally efficient algorithm for structured information extraction from web pages. The algorithm relies on a combination of analysis of structured data and natural language processing of text content. It maps HTML pages containing news, reviews or user comments to a custom designed RSS feed like structure. Such information usually includes the textual opinions, and factual information like publication date, product price, author name and influence. Due to the real time nature of the data sources the computational complexity of such a solution should be linear or close to linear. The computational complexity of the proposed algorithm is linear. In comparison similar previously published approaches have complexity no smaller than O(n2). Further we conduct experiments with real world data that achieves extraction accuracy of 84% to 92% which is comparable to the recent results in this field. Finally the paper discuses the results of the experiment and shares gained experience that can be useful for applying the algorithm in other domains.