Classifying news stories using memory based reasoning
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Interactive, Domain-Independent Identification and Summarization of Topically Related News Articles
ECDL '01 Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries
Newsjunkie: providing personalized newsfeeds via analysis of information novelty
Proceedings of the 13th international conference on World Wide Web
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
WWW '05 Proceedings of the 14th international conference on World Wide Web
The anatomy of a news search engine
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
HLT '01 Proceedings of the first international conference on Human language technology research
QCS: a tool for querying, clustering, and summarizing documents
NAACL-Demonstrations '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Demonstrations - Volume 4
Google news personalization: scalable online collaborative filtering
Proceedings of the 16th international conference on World Wide Web
Tracking and summarizing news on a daily basis with Columbia's Newsblaster
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Hi-index | 0.00 |
Newistic is a web mining platform that collects and analyses documents crawled from the Internet. Although it currently processes news articles, it can be easily adapted to any other form of text. Data mining functions performed by the system are categorization, clustering and named entity extraction. The main design concern of the system is scalability, which is achieved by a modular architecture that allows multiple instances of the same component to be run in parallel. This paper presents a novel algorithm for analysing web pages which tries to determine the title and text of a news item directly from the HTML code, discarding noise such as menus, ads, or copyright notices. Another contribution of this paper is the application of the Quality Threshold clustering algorithm for document clustering. Additionally, the algorithm has been optimized to increase its speed.