Computer Evaluation of Indexing and Text Processing
Journal of the ACM (JACM)
An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
A vector space model for automatic indexing
Communications of the ACM
Rough Sets: Theoretical Aspects of Reasoning about Data
Rough Sets: Theoretical Aspects of Reasoning about Data
Machine Learning
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
Neural Networks for Web Content Filtering
IEEE Intelligent Systems
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Rule Discovery from Databases with Decision Matrices
ISMIS '96 Proceedings of the 9th International Symposium on Foundations of Intelligent Systems
A Rough Set-Based Approach to Text Classification
RSFDGrC '99 Proceedings of the 7th International Workshop on New Directions in Rough Sets, Data Mining, and Granular-Soft Computing
TEA: A Text Analysis Tool for the Intelligent Text Document Filtering
TDS '00 Proceedings of the Third International Workshop on Text, Speech and Dialogue
Medical diagnosis with C4.5 rule preceded by artificial neural network ensemble
IEEE Transactions on Information Technology in Biomedicine
The fitness-rough: A new attribute reduction method based on statistical and rough set theory
Intelligent Data Analysis
Hi-index | 0.00 |
Feature selection is a very important step in text preprocessing, a good selected feature subset can get the same performance than using full features, at the same time, it reduced the learning time. To make our system fit for the application and to embed this model gateway for real-time text filtering, we need to further select more accurate features. In this paper, we proposed a new feature selection method based on Rough set theory. It generate several reducts, but the special point is that between these reducts there are no common attributes, so these attributes have more powerfully capability to classify new objects, especially for real data set in application. We choose two data sets to evaluate our feature selection method, one is a benchmark data set from UCI machine learning archive, and another is captured from Web. We use statistical classification methods to classify these objects, in the benchmark testing set, we get good precision with a single reduct, but in real date set, we get good precision with several reducts, and the data set is used in our system for topic-specific text filtering. Thus we conclude our method is very effective in application. In addition, we also conclude that SVM and VSM methods get better performance, while Naïve Bayes method get poor performance with the same selected features on non-balance data set.