Hierarchical document categorization with k-NN and concept-based thesauri

Authors:
Sun Lee Bang;Jae Dong Yang;Hyung Jeong Yang
Affiliations:
Department of Computer and Statistical Informatics, Chonbuk National University, Jeonju, South Korea;Division of Electronics and Information Engineering, Chonbuk National University, Jeonju, South Korea;Division of Electronics and Computer Engineering, Chonnam National University, Gwangju, South Korea
Venue:
Information Processing and Management: an International Journal
Year:
2006

Citing 9
Cited 9

Information storage and retrieval

Information storage and retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Simple KNN Algorithm for Text Categorization

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies

The VLDB Journal — The International Journal on Very Large Data Bases
Automatic Textual Document Categorization Based on Generalized Instance Sets and a Metamodel

IEEE Transactions on Pattern Analysis and Machine Intelligence
Text Document Categorization by Term Association

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Enhancing Text Classification Using Synopses Extraction

WISE '03 Proceedings of the Fourth International Conference on Web Information Systems Engineering

The Role of Different Thesauri Terms and Captions in Automated Subject Classification

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
A study of local and global thresholding techniques in text categorization

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
A document classification and retrieval system for R&D in semiconductor industry - A hybrid approach

Expert Systems with Applications: An International Journal
Combination of modified BPNN algorithms and an efficient feature selection method for text categorization

Information Processing and Management: an International Journal
An automatically constructed thesaurus for neural network based document categorization

Expert Systems with Applications: An International Journal
Parametric and nonparametric evolutionary computing with a content-based feature selection approach for parallel categorization

Expert Systems with Applications: An International Journal
Automatic thesaurus construction for spam filtering using revised back propagation neural network

Expert Systems with Applications: An International Journal
CorpWiki: A self-regulating wiki to promote corporate collective intelligence through expert peer matching

Information Sciences: an International Journal
A proposed method of local feature-weighting to improve predictions of basic nearest neighbor rule

ASC '07 Proceedings of The Eleventh IASTED International Conference on Artificial Intelligence and Soft Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper, we propose a new algorithm, which incorporates the relationships of concept-based thesauri into the document categorization using the k-NN classifier (k-NN). k-NN is one of the most popular document categorization methods because it shows relatively good performance in spite of its simplicity. However, it significantly degrades precision when ambiguity arises, i.e., when there exist more than one candidate category to which a document can be assigned. To remedy the drawback, we employ concept-based thesauri in the categorization. Employing the thesaurus entails structuring categories into hierarchies, since their structure needs to be conformed to that of the thesaurus for capturing relationships between categories. By referencing various relationships in the thesaurus corresponding to the structured categories, k-NN can be prominently improved, removing the ambiguity. In this paper, we first perform the document categorization by using k-NN and then employ the relationships to reduce the ambiguity. Experimental results show that this method improves the precision of k-NN up to 13.86% without compromising its recall.