A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization

Authors:
Jieming Yang;Yuanning Liu;Xiaodong Zhu;Zhen Liu;Xiaoxu Zhang
Affiliations:
College of Computer Science and Technology, Jilin University, Changchun, Jilin, China and School of Information Engineering, Northeast Dianli University, Jilin, Jilin, China;College of Computer Science and Technology, Jilin University, Changchun, Jilin, China;College of Computer Science and Technology, Jilin University, Changchun, Jilin, China;College of Computer Science and Technology, Jilin University, Changchun, Jilin, China and Graduate School of Engineering, Nagasaki Institute of Applied Science, Nagasaki-shi, Nagasaki, Japan;College of Computer Science and Technology, Jilin University, Changchun, Jilin, China
Venue:
Information Processing and Management: an International Journal
Year:
2012

Citing 20
Cited 3

Selection of relevant features and examples in machine learning

Artificial Intelligence - Special issue on relevance
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Induction of Decision Trees

Machine Learning
Feature selection on hierarchy of web documents

Decision Support Systems - Web retrieval and mining
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Best terms: an efficient feature-selection algorithm for text categorization

Knowledge and Information Systems
Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy

IEEE Transactions on Pattern Analysis and Machine Intelligence
OCFS: optimal orthogonal centroid feature selection for text categorization

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Feature selection with a measure of deviations from Poisson in text categorization

Expert Systems with Applications: An International Journal
Feature selection for text classification with Naïve Bayes

Expert Systems with Applications: An International Journal
Class dependent feature scaling method using naive Bayes classifier for text datamining

Pattern Recognition Letters
A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability

Soft Computing - A Fusion of Foundations, Methodologies and Applications
Ambiguity measure feature-selection algorithm

Journal of the American Society for Information Science and Technology
Learning from Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
A new feature selection algorithm based on binomial hypothesis testing for spam filtering

Knowledge-Based Systems
Nearest neighbor pattern classification

IEEE Transactions on Information Theory
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Text Document Clustering with Hybrid Feature Selection

Proceedings of International Conference on Information Integration and Web-based Applications & Services
Sentiment visualization and classification via semi-supervised nonlinear dimensionality reduction

Pattern Recognition
Analyzing uncertainties of probabilistic rough set regions with game-theoretic rough sets

International Journal of Approximate Reasoning

Quantified Score

Hi-index	0.00

Visualization

Abstract

The feature selection, which can reduce the dimensionality of vector space without sacrificing the performance of the classifier, is widely used in text categorization. In this paper, we proposed a new feature selection algorithm, named CMFS, which comprehensively measures the significance of a term both in inter-category and intra-category. We evaluated CMFS on three benchmark document collections, 20-Newsgroups, Reuters-21578 and WebKB, using two classification algorithms, Naive Bayes (NB) and Support Vector Machines (SVMs). The experimental results, comparing CMFS with six well-known feature selection algorithms, show that the proposed method CMFS is significantly superior to Information Gain (IG), Chi statistic (CHI), Document Frequency (DF), Orthogonal Centroid Feature Selection (OCFS) and DIA association factor (DIA) when Naive Bayes classifier is used and significantly outperforms IG, DF, OCFS and DIA when Support Vector Machines are used.