Text clustering with important words using normalization

Authors:
Shunyao Wu;Jinlong Wang;Huy Quan Vu;Gang Li
Affiliations:
Qingdao Technological University, Qingdao, China;Qingdao Technological University, Qingdao, China;Deakin University, Victoria, Australia;Deakin University, Victoria, Australia
Venue:
Proceedings of the 10th annual joint conference on Digital libraries
Year:
2010

Citing 1
Cited 0

Clustering short texts using wikipedia

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Important words, which usually exist in part of Title, Subject and Keywords, can briefly reflect the main topic of a document. In recent years, it is a common practice to exploit the semantic topic of documents and utilize important words to achieve document clustering, especially for short texts such as news articles. This paper proposes a novel method to extract important words from Subject and Keywords of articles, and then partition documents only with those important words. Considering the fact that frequencies of important words are usually low and the scale matrix dataset for important words is small, a normalization method is then proposed to normalize the scale dataset so that more accurate results can be achieved by sufficiently exploiting the limited information. The experiments validate the effectiveness of our method.