An online recommendation system for e-commerce based on apache mahout framework
Proceedings of the 2013 annual conference on Computers and people research
Extracting Knowledge from Wikipedia Articles through Distributed Semantic Analysis
Proceedings of the 13th International Conference on Knowledge Management and Knowledge Technologies
Hi-index | 0.00 |
This paper compares k-means and fuzzy c-means for clustering a noisy realistic and big dataset. We made the comparison using a free cloud computing solution Apache Mahout/ Hadoop and Wikipedia's latest articles. In the past the usage of these two algorithms was restricted to small datasets. As so, studies were based on artificial datasets that do not represent a real document clustering situation. With this ongoing research we found that in a noisy dataset, fuzzy c-means can lead to worse cluster quality than k-means. The convergence speed of k-means is not always faster. We found as well that Mahout is a promise clustering technology but the preprocessing tools are not developed enough for an efficient dimensionality reduction. From our experience the use of the Apache Mahout is premature.