Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles

Authors:
Maria Soledad Pera;Yiu-Kai Dennis Ng
Affiliations:
3361 TMCB, Computer Science Department, Brigham Young University, Provo, USA 84602;3361 TMCB, Computer Science Department, Brigham Young University, Provo, USA 84602
Venue:
Journal of Intelligent Information Systems
Year:
2012

Citing 17
Cited 0

A fuzzy document retrieval system using the keyword connection matrix and a learning method

Fuzzy Sets and Systems - Special issue on applications of fuzzy systems theory, Iizuka '88
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data clustering: a review

ACM Computing Surveys (CSUR)
Concept decompositions for large sparse text data using clustering

Machine Learning
Machine Learning

Machine Learning
Unsupervised document classification using sequential information maximization

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering with cluster refinement and model selection capabilities

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Enhanced word clustering for hierarchical text classification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Phrase-based Document Similarity Based on an Index Graph Model

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Document clustering by concept factorization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
Categorizing and Extracting Information from Multilingual HTML Documents

IDEAS '05 Proceedings of the 9th International Database Engineering & Application Symposium
Enhancing text clustering by leveraging Wikipedia semantics

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Search Engines: Information Retrieval in Practice

Search Engines: Information Retrieval in Practice
Using word clusters to detect similar web documents

KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
A comparative study on text clustering methods

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

RSS news articles that are either partially or completely duplicated in content are easily found on the Internet these days, which require Web users to sort through the articles to identify non-redundant information. This manual-filtering process is time-consuming and tedious. In this paper, we present a new filtering and clustering approach, called FICUS, which starts with identifying and eliminating redundant RSS news articles using a fuzzy set information retrieval approach and then clusters the remaining non-redundant RSS news articles according to their degrees of resemblance. FICUS uses a tree hierarchy to organize clusters of RSS news articles. The contents of the respective clusters are captured by the representative keywords from RSS news articles in the clusters so that searching and retrieval of similar RSS news articles is fast and efficient. FICUS is simple, since it uses the pre-defined word-correlation factors to determine related (words in) RSS news articles and filter redundant ones, and is supported by well-known and yet simple mathematical models, such as the standard deviation, vector space model, and probability theory, to generate clusters of non-redundant RSS news articles. Experiments performed on (test sets of) RSS news articles on various topics, which were downloaded from different online sources, verify the accuracy of FICUS on eliminating redundant RSS news articles, clustering similar RSS news articles together, and segregating different RSS news articles in terms of their contents. In addition, further empirical studies show that FICUS outperforms well-known approaches adopted for clustering RSS news articles.