Using Mahout for Clustering Wikipedia's Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud

Authors:
Rui Máximo Esteves;Chunming Rong
Affiliations:
-;-
Venue:
CLOUDCOM '11 Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science
Year:
2011

Citing 0
Cited 2

An online recommendation system for e-commerce based on apache mahout framework

Proceedings of the 2013 annual conference on Computers and people research
Extracting Knowledge from Wikipedia Articles through Distributed Semantic Analysis

Proceedings of the 13th International Conference on Knowledge Management and Knowledge Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper compares k-means and fuzzy c-means for clustering a noisy realistic and big dataset. We made the comparison using a free cloud computing solution Apache Mahout/ Hadoop and Wikipedia's latest articles. In the past the usage of these two algorithms was restricted to small datasets. As so, studies were based on artificial datasets that do not represent a real document clustering situation. With this ongoing research we found that in a noisy dataset, fuzzy c-means can lead to worse cluster quality than k-means. The convergence speed of k-means is not always faster. We found as well that Mahout is a promise clustering technology but the preprocessing tools are not developed enough for an efficient dimensionality reduction. From our experience the use of the Apache Mahout is premature.