Text clustering with important words using normalization

  • Authors:
  • Shunyao Wu;Jinlong Wang;Huy Quan Vu;Gang Li

  • Affiliations:
  • Qingdao Technological University, Qingdao, China;Qingdao Technological University, Qingdao, China;Deakin University, Victoria, Australia;Deakin University, Victoria, Australia

  • Venue:
  • Proceedings of the 10th annual joint conference on Digital libraries
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Important words, which usually exist in part of Title, Subject and Keywords, can briefly reflect the main topic of a document. In recent years, it is a common practice to exploit the semantic topic of documents and utilize important words to achieve document clustering, especially for short texts such as news articles. This paper proposes a novel method to extract important words from Subject and Keywords of articles, and then partition documents only with those important words. Considering the fact that frequencies of important words are usually low and the scale matrix dataset for important words is small, a normalization method is then proposed to normalize the scale dataset so that more accurate results can be achieved by sufficiently exploiting the limited information. The experiments validate the effectiveness of our method.