Scalable Web Mining with Newistic

Authors:
Ovidiu Dan;Horatiu Mocian
Affiliations:
INHOLLAND University, Diemen, The Netherlands;Imperial College, London, United Kingdom
Venue:
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Year:
2009

Citing 11
Cited 0

Classifying news stories using memory based reasoning

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Interactive, Domain-Independent Identification and Summarization of Topically Related News Articles

ECDL '01 Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries
Newsjunkie: providing personalized newsfeeds via analysis of information novelty

Proceedings of the 13th international conference on World Wide Web
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Ranking a stream of news

WWW '05 Proceedings of the 14th international conference on World Wide Web
The anatomy of a news search engine

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
NewsInEssence: a system for domain-independent, real-time news clustering and multi-document summarization

HLT '01 Proceedings of the first international conference on Human language technology research
QCS: a tool for querying, clustering, and summarizing documents

NAACL-Demonstrations '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Demonstrations - Volume 4
Google news personalization: scalable online collaborative filtering

Proceedings of the 16th international conference on World Wide Web
Tracking and summarizing news on a daily basis with Columbia's Newsblaster

HLT '02 Proceedings of the second international conference on Human Language Technology Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Newistic is a web mining platform that collects and analyses documents crawled from the Internet. Although it currently processes news articles, it can be easily adapted to any other form of text. Data mining functions performed by the system are categorization, clustering and named entity extraction. The main design concern of the system is scalability, which is achieved by a modular architecture that allows multiple instances of the same component to be run in parallel. This paper presents a novel algorithm for analysing web pages which tries to determine the title and text of a news item directly from the HTML code, discarding noise such as menus, ads, or copyright notices. Another contribution of this paper is the application of the Quality Threshold clustering algorithm for document clustering. Additionally, the algorithm has been optimized to increase its speed.