Automatic text processing: the transformation, analysis, and retrieval of information by computer
Automatic text processing: the transformation, analysis, and retrieval of information by computer
Classifying news stories using memory based reasoning
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
An example-based mapping method for text categorization and retrieval
ACM Transactions on Information Systems (TOIS)
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing
Foundations of statistical natural language processing
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Improving text categorization methods for event tracking
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Topic Detection and Tracking: Event-Based Information Organization
Topic Detection and Tracking: Event-Based Information Organization
Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Machine learning methods for Chinese web page categorization
CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
Meta-algorithmic systems for document classification
Proceedings of the 2006 ACM symposium on Document engineering
Extending the single words-based document model: a comparison of bigrams and 2-itemsets
Proceedings of the 2006 ACM symposium on Document engineering
Combining Subclassifiers in Text Categorization: A DST-Based Solution and a Case Study
IEEE Transactions on Knowledge and Data Engineering
Computers in Biology and Medicine
INDUCTION FROM MULTI-LABEL EXAMPLES IN INFORMATION RETRIEVAL SYSTEMS: A CASE STUDY
Applied Artificial Intelligence
Imbalanced text classification: A term weighting approach
Expert Systems with Applications: An International Journal
Exploring multilingual semantic role labeling
CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task
3PRS: a personalized popular program recommendation system for digital TV for P2P social networks
Multimedia Tools and Applications
Intelligent location-based mobile news service system with automatic news summarization
Expert Systems with Applications: An International Journal
PAKDD'09 Proceedings of the 13th Pacific-Asia international conference on Knowledge discovery and data mining: new frontiers in applied data mining
Comparison of metrics for feature selection in imbalanced text classification
Expert Systems with Applications: An International Journal
CPRS: A cloud-based program recommendation system for digital TV platforms
Future Generation Computer Systems
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
An optimally weighted fuzzy k-NN algorithm
ICAPR'05 Proceedings of the Third international conference on Advances in Pattern Recognition - Volume Part I
An adaptive fuzzy kNN text classifier
ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part III
Expert Systems with Applications: An International Journal
Design and implementation of an ontology algorithm for web documents classification
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part IV
CPRS: a cloud-based program recommendation system for digital TV platforms
GPC'10 Proceedings of the 5th international conference on Advances in Grid and Pervasive Computing
Sample cutting method for imbalanced text sentiment classification based on BRC
Knowledge-Based Systems
The decomposed k-nearest neighbor algorithm for imbalanced text classification
FGIT'12 Proceedings of the 4th international conference on Future Generation Information Technology
A cloud-based intelligent TV program recommendation system
Computers and Electrical Engineering
Just-in-time adaptive similarity component analysis in nonstationary environments
Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology
Irrelevant attributes and imbalanced classes in multi-label text-categorization domains
Intelligent Data Analysis
Hi-index | 0.01 |
k is the most important parameter in a text categorization system based on the k-nearest neighbor algorithm (kNN). To classify a new document, the k-nearest documents in the training set are determined first. The prediction of categories for this document can then be made according to the category distribution among the k nearest neighbors. Generally speaking, the class distribution in a training set is not even; some classes may have more samples than others. The system's performance is very sensitive to the choice of the parameter k. And it is very likely that a fixed k value will result in a bias for large categories, and will not make full use of the information in the training set. To deal with these problems, an improved kNN strategy, in which different numbers of nearest neighbors for different categories are used instead of a fixed number across all categories, is proposed in this article. More samples (nearest neighbors) will be used to decide whether a test document should be classified in a category that has more samples in the training set. The numbers of nearest neighbors selected for different categories are adaptive to their sample size in the training set. Experiments on two different datasets show that our methods are less sensitive to the parameter k than the traditional ones, and can properly classify documents belonging to smaller classes with a large k. The strategy is especially applicable and promising for cases where estimating the parameter k via cross-validation is not possible and the class distribution of a training set is skewed.