A Non-VSM kNN algorithm for text classification

Authors:
Zhi-Hong Deng;Shi-Wei Tang
Affiliations:
National Laboratory on Machine Perception, School of Electronics Engineering and Computer Science, Peking University, Beijing, China;National Laboratory on Machine Perception, School of Electronics Engineering and Computer Science, Peking University, Beijing, China
Venue:
ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
Year:
2005

Citing 9
Cited 0

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Computer Evaluation of Indexing and Text Processing

Journal of the ACM (JACM)
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Linear Text Classification Algorithm Based on Category Relevance Factors

ICADL '02 Proceedings of the 5th International Conference on Asian Digital Libraries: Digital Libraries: People, Knowledge, and Technology

Quantified Score

Hi-index	0.01

Visualization

Abstract

The text classification problem, which is the task of assigning natural language texts to predefined categories based on their content, has been widely studied. Traditional text classification use VSM (Vector Space Model), which views documents as vectors in high dimensional spaces, to represent documents. In this paper, we propose a non-VSM kNN algorithm for text classification. Based on correlations between categories and features, the algorithms first get k F-C tuples, which are the first k tuples in term of correlation value, from an unlabeled document. Then the algorithm predicts the category of the unlabeled documents via these tuples. We have evaluated the algorithm on two document collections and compared it against traditional kNN. Experimental results show that our algorithm outperforms traditional kNN in both efficiency and effectivity.