An Effective Dimension Reduction Approach to Chinese Document Classification Using Genetic Algorithm
ISNN 2009 Proceedings of the 6th International Symposium on Neural Networks: Advances in Neural Networks - Part II
Hi-index | 0.00 |
Text classification is one of means to understand text content. It is widely used in information retrieving, fil- tering spam, monitoring ill gossips, and blocking porno- graphic and evil messages. kNN is widely used in text categorization, but it suffers from biased training data set. In developing Prototype of Internet Information Security for Shanghai Council of Information and Security, we de- tect that when training data set is biased, almost all test documents of some rare (smaller) categories are classi- fied into common (larger) ones by traditional kNN clas- sifier. The performance of text classification can not sat- isfy the user's requirement in this case. To alleviate such a misfortune, we adopt 2 measures to boost kNN classi- fier. Firstly, we optimize features by removing some can- didate features. Secondly, we modify traditional decision rules by integrating number of training samples of each category with them. Exhaustive experiments illustrate that the adapted kNN achieves significant classification perfor- mance improvement on biased corpora.