Web Document Clustering by Using Automatic Keyphrase Extraction

Authors:
Juhyun Han;Taehwan Kim;Joongmin Choi
Affiliations:
-;-;-
Venue:
WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
Year:
2007

Citing 6
Cited 1

KEA: practical automatic keyphrase extraction

Proceedings of the fourth ACM conference on Digital libraries
Cluster merging and splitting in hierarchical clustering algorithms

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms

HICSS '04 Proceedings of the Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS'04) - Track 4 - Volume 4
TopCat: Data Mining for Topic Identification in a Text Corpus

IEEE Transactions on Knowledge and Data Engineering
Domain-specific keyphrase extraction

IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2
Coherent keyphrase extraction via web mining

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

A web content mining approach for tag cloud generation

Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

In most traditional techniques of document clustering, the number of total clusters is not known in advance and the cluster that contain the target information cannot be determined since the semantic nature is not associated with the cluster. The well-known K-means clustering algorithm partially solves these problems by allowing users to specify the number of clusters. However, if the pre-specified number of clusters is modified, the precision of each result also changes. To solve this problem, this paper proposes a new clustering algorithm based on the Kea keyphrase extraction algorithm which returns several keyphrases from the source documents by using some machine learning techniques. In this paper, documents are grouped into several clusters like K-means, but the number of clusters is automatically determined by the algorithm with some heuristics using the extracted keyphrases. Our Kea-means clustering algorithm provides easy and efficient ways to extract test documents from massive quantities of resources.