A clustering technique for news articles using WordNet

Authors:
Christos Bouras;Vassilis Tsogkas
Affiliations:
Computer Technology Institute and Press "Diophantus", Patras, Greece and Computer Engineering and Informatics Department, University of Patras, 26500, Rion, Patras, Greece;Computer Technology Institute and Press "Diophantus", Patras, Greece and Computer Engineering and Informatics Department, University of Patras, 26500, Rion, Patras, Greece
Venue:
Knowledge-Based Systems
Year:
2012

Citing 18
Cited 1

Comparison of hierarchic agglomerative clustering methods for document retrieval

The Computer Journal
A vector space model for automatic indexing

Communications of the ACM
Refining Initial Points for K-Means Clustering

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Semantic similarity methods in wordNet and their application to information retrieval on the web

Proceedings of the 7th annual ACM international workshop on Web information and data management
Automatically labeling hierarchical clusters

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Parallel bisecting k-means with prediction clustering algorithm

The Journal of Supercomputing
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
PeRSSonal's core functionality evaluation: Enhancing text labeling through personalized summaries

Data & Knowledge Engineering
The Evaluation Measure of Text Clustering for the Variable Number of Clusters

ISNN '07 Proceedings of the 4th international symposium on Neural Networks: Part II--Advances in Neural Networks
Improving Text Summarization Using Noun Retrieval Techniques

KES '08 Proceedings of the 12th international conference on Knowledge-Based Intelligent Information and Engineering Systems, Part II
An Integration of Fuzzy Association Rules and WordNet for Document Clustering

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Enhancing cluster labeling using wikipedia

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
SenseClusters - finding clusters that represent word senses

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
WordNet-based text document clustering

ROMAND '04 Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data
Generic title labeling for clustered documents

Expert Systems with Applications: An International Journal
Frequent itemset based hierarchical document clustering using Wikipedia as external knowledge

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Short communication: Selective Subsequence Time Series clustering

Knowledge-Based Systems

Locality mutual clustering for document retrieval

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web is overcrowded with news articles, an overwhelming information source both with its amount and diversity. Document clustering is a powerful technique that has been widely used for organizing data into smaller and manageable information kernels. Several approaches have been proposed which, however, suffer from problems like synonymy, ambiguity and lack of a descriptive content marking of the generated clusters. In this work, we are investigating the application of a great spectrum of clustering algorithms, as well as similarity measures, to news articles that originate from the Web. Also, we are proposing the enhancement of standard k-means algorithm using the external knowledge from WordNet hypernyms in a twofold manner: enriching the ''bag of words'' used prior to the clustering process and assisting the label generation procedure following it. Furthermore, we are examining the effect that text preprocessing has on clustering. Operating on a corpus of news articles derived from major news portals, our comparison of the existing clustering methodologies revealed that k-means, gives better aggregate results when it comes to efficiency. This is amplified when the algorithm is accompanied with preliminary steps for data cleaning and normalizing, despite its simple nature. Moreover, the proposed WordNet-enabled W-k means clustering algorithm significantly improves standard k-means generating also useful and high quality cluster tags by using the presented cluster labeling process.