Hierarchical document categorization with k-NN and concept-based thesauri

Authors:
Sun Lee Bang;Jae Dong Yang;Hyung Jeong Yang
Affiliations:
Department of Computer and Statistical Informatics, Chonbuk National University, Jeonju 561-756, South Korea;Division of Electronics and Information Engineering, Chonbuk National University, Jeonju 561-756, South Korea;Division of Electronics and Computer Engineering, Chonnam National University, Gwangju 500-757, South Korea
Venue:
Information Processing and Management: an International Journal
Year:
2006

Citing 8
Cited 6

An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Simple KNN Algorithm for Text Categorization

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies

The VLDB Journal — The International Journal on Very Large Data Bases
Automatic Textual Document Categorization Based on Generalized Instance Sets and a Metamodel

IEEE Transactions on Pattern Analysis and Machine Intelligence
Text Document Categorization by Term Association

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Enhancing Text Classification Using Synopses Extraction

WISE '03 Proceedings of the Fourth International Conference on Web Information Systems Engineering

Abordagem não supervisionada para extração de conceitos a partir de textos

Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web
A coarse-to-fine framework to efficiently thwart plagiarism

Pattern Recognition
An approach to expert recommendation based on fuzzy linguistic method and fuzzy text classification in knowledge management systems

Expert Systems with Applications: An International Journal
Text categorization algorithms using semantic approaches, corpus-based thesaurus and WordNet

Expert Systems with Applications: An International Journal
Spam filtering using semantic similarity approach and adaptive BPNN

Neurocomputing
Conceptual modeling of cardinality constraints in social publishing

International Journal of Intelligent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a new algorithm, which incorporates the relationships of concept-based thesauri into the document categorization using the k-NN classifier (k-NN). k-NN is one of the most popular document categorization methods because it shows relatively good performance in spite of its simplicity. However, it significantly degrades precision when ambiguity arises, i.e., when there exist more than one candidate category to which a document can be assigned. To remedy the drawback, we employ concept-based thesauri in the categorization. Employing the thesaurus entails structuring categories into hierarchies, since their structure needs to be conformed to that of the thesaurus for capturing relationships between categories. By referencing various relationships in the thesaurus corresponding to the structured categories, k-NN can be prominently improved, removing the ambiguity. In this paper, we first perform the document categorization by using k-NN and then employ the relationships to reduce the ambiguity. Experimental results show that this method improves the precision of k-NN up to 13.86% without compromising its recall.