Scalable Web Mining with Newistic

  • Authors:
  • Ovidiu Dan;Horatiu Mocian

  • Affiliations:
  • INHOLLAND University, Diemen, The Netherlands;Imperial College, London, United Kingdom

  • Venue:
  • PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Newistic is a web mining platform that collects and analyses documents crawled from the Internet. Although it currently processes news articles, it can be easily adapted to any other form of text. Data mining functions performed by the system are categorization, clustering and named entity extraction. The main design concern of the system is scalability, which is achieved by a modular architecture that allows multiple instances of the same component to be run in parallel. This paper presents a novel algorithm for analysing web pages which tries to determine the title and text of a news item directly from the HTML code, discarding noise such as menus, ads, or copyright notices. Another contribution of this paper is the application of the Quality Threshold clustering algorithm for document clustering. Additionally, the algorithm has been optimized to increase its speed.