Hierarchical document categorization with k-NN and concept-based thesauri

  • Authors:
  • Sun Lee Bang;Jae Dong Yang;Hyung Jeong Yang

  • Affiliations:
  • Department of Computer and Statistical Informatics, Chonbuk National University, Jeonju 561-756, South Korea;Division of Electronics and Information Engineering, Chonbuk National University, Jeonju 561-756, South Korea;Division of Electronics and Computer Engineering, Chonnam National University, Gwangju 500-757, South Korea

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose a new algorithm, which incorporates the relationships of concept-based thesauri into the document categorization using the k-NN classifier (k-NN). k-NN is one of the most popular document categorization methods because it shows relatively good performance in spite of its simplicity. However, it significantly degrades precision when ambiguity arises, i.e., when there exist more than one candidate category to which a document can be assigned. To remedy the drawback, we employ concept-based thesauri in the categorization. Employing the thesaurus entails structuring categories into hierarchies, since their structure needs to be conformed to that of the thesaurus for capturing relationships between categories. By referencing various relationships in the thesaurus corresponding to the structured categories, k-NN can be prominently improved, removing the ambiguity. In this paper, we first perform the document categorization by using k-NN and then employ the relationships to reduce the ambiguity. Experimental results show that this method improves the precision of k-NN up to 13.86% without compromising its recall.