Automatic Category Structure Generation and Categorization of Chinese Text Documents

  • Authors:
  • Hsin-Chang Yang;Chung-Hong Lee

  • Affiliations:
  • -;-

  • Venue:
  • PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
  • Year:
  • 2000

Quantified Score

Hi-index 0.01

Visualization

Abstract

Recently knowledge discovery and data mining in unstructured or semi-structured texts(text mining) has been attracted lots of attention from both commercial and research fields. One aspect of text mining is on automatic text categorization, which assigns a text document to some predefined category according to the correlation between the document and the category. Traditionally the categories are arranged in hierarchical manner to achieve effective searching and indexing as well as easy comprehension for human. The determination of categories and their hierarchical structures were most done by human experts. In this work, we developed an approach to automatically generate categories and reveal the hierarchical structure among them. We also used the generated structure to categorize text documents. The document collection is trained by a self-organizing map to form two feature maps. We then analyzed the two maps to obtain the categories and the structure among them. Although the corpus contains documents written in Chinese, the proposed approach can be applied to documents written in any language and such documents can be transformed into a list of separated terms.