An adaptive k-nearest neighbor text categorization strategy

Authors:
Li Baoli;Lu Qin;Yu Shiwen
Affiliations:
Peking University, Beijing, China;The Hong Kong Polytechnic University, Kowloon, Hong Kong;Peking University, Beijing, China
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2004

Citing 14
Cited 23

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Classifying news stories using memory based reasoning

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
An example-based mapping method for text categorization and retrieval

ACM Transactions on Information Systems (TOIS)
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Improving text categorization methods for event tracking

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Topic Detection and Tracking: Event-Based Information Organization

Topic Detection and Tracking: Event-Based Information Organization
Machine Learning

Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Machine learning methods for Chinese web page categorization

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12

Meta-algorithmic systems for document classification

Proceedings of the 2006 ACM symposium on Document engineering
Extending the single words-based document model: a comparison of bigrams and 2-itemsets

Proceedings of the 2006 ACM symposium on Document engineering
Combining Subclassifiers in Text Categorization: A DST-Based Solution and a Case Study

IEEE Transactions on Knowledge and Data Engineering
Screening of knee-joint vibroarthrographic signals using the strict 2-surface proximal classifier and genetic algorithm

Computers in Biology and Medicine
INDUCTION FROM MULTI-LABEL EXAMPLES IN INFORMATION RETRIEVAL SYSTEMS: A CASE STUDY

Applied Artificial Intelligence
Imbalanced text classification: A term weighting approach

Expert Systems with Applications: An International Journal
Exploring multilingual semantic role labeling

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task
3PRS: a personalized popular program recommendation system for digital TV for P2P social networks

Multimedia Tools and Applications
Intelligent location-based mobile news service system with automatic news summarization

Expert Systems with Applications: An International Journal
Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains

PAKDD'09 Proceedings of the 13th Pacific-Asia international conference on Knowledge discovery and data mining: new frontiers in applied data mining
Comparison of metrics for feature selection in imbalanced text classification

Expert Systems with Applications: An International Journal
CPRS: A cloud-based program recommendation system for digital TV platforms

Future Generation Computer Systems
k-Information Gain Scaled Nearest Neighbors: A Novel Approach to Classifying Protein-Protein Interaction-Related Documents

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
An optimally weighted fuzzy k-NN algorithm

ICAPR'05 Proceedings of the Third international conference on Advances in Pattern Recognition - Volume Part I
An adaptive fuzzy kNN text classifier

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part III
Comparison of term frequency and document frequency based feature selection metrics in text categorization

Expert Systems with Applications: An International Journal
Design and implementation of an ontology algorithm for web documents classification

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part IV
CPRS: a cloud-based program recommendation system for digital TV platforms

GPC'10 Proceedings of the 5th international conference on Advances in Grid and Pervasive Computing
Sample cutting method for imbalanced text sentiment classification based on BRC

Knowledge-Based Systems
The decomposed k-nearest neighbor algorithm for imbalanced text classification

FGIT'12 Proceedings of the 4th international conference on Future Generation Information Technology
A cloud-based intelligent TV program recommendation system

Computers and Electrical Engineering
Just-in-time adaptive similarity component analysis in nonstationary environments

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology
Irrelevant attributes and imbalanced classes in multi-label text-categorization domains

Intelligent Data Analysis

Quantified Score

Hi-index	0.01

Visualization

Abstract

k is the most important parameter in a text categorization system based on the k-nearest neighbor algorithm (kNN). To classify a new document, the k-nearest documents in the training set are determined first. The prediction of categories for this document can then be made according to the category distribution among the k nearest neighbors. Generally speaking, the class distribution in a training set is not even; some classes may have more samples than others. The system's performance is very sensitive to the choice of the parameter k. And it is very likely that a fixed k value will result in a bias for large categories, and will not make full use of the information in the training set. To deal with these problems, an improved kNN strategy, in which different numbers of nearest neighbors for different categories are used instead of a fixed number across all categories, is proposed in this article. More samples (nearest neighbors) will be used to decide whether a test document should be classified in a category that has more samples in the training set. The numbers of nearest neighbors selected for different categories are adaptive to their sample size in the training set. Experiments on two different datasets show that our methods are less sensitive to the parameter k than the traditional ones, and can properly classify documents belonging to smaller classes with a large k. The strategy is especially applicable and promising for cases where estimating the parameter k via cross-validation is not possible and the class distribution of a training set is skewed.