Evolutionary instance selection for text classification

Authors:
Chih-Fong Tsai;Zong-Yao Chen;Shih-Wen Ke
Affiliations:
-;-;-
Venue:
Journal of Systems and Software
Year:
2014

Citing 22
Cited 0

Instance-Based Learning Algorithms

Machine Learning
Data preparation for data mining

Data preparation for data mining
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Statistical Pattern Recognition: A Review

IEEE Transactions on Pattern Analysis and Machine Intelligence
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Advances in Instance Selection for Instance-Based Learning Algorithms

Data Mining and Knowledge Discovery
A Unifying View on Instance Selection

Data Mining and Knowledge Discovery
Induction of Decision Trees

Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Feature selection methods for text classification

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
A survey on the application of genetic programming to classification

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Prototype reduction techniques: A comparison among different approaches

Expert Systems with Applications: An International Journal
Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study

IEEE Transactions on Pattern Analysis and Machine Intelligence
Dimensionality reduction using genetic algorithms

IEEE Transactions on Evolutionary Computation
Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study

IEEE Transactions on Evolutionary Computation
Genetic algorithms in feature and instance selection

Knowledge-Based Systems
SVOIS: Support Vector Oriented Instance Selection for text classification

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text classification is usually based on constructing a model through learning from training examples to automatically classify text documents. However, as the size of text document repositories grows rapidly, the storage requirement and computational cost of model learning become higher. Instance selection is one solution to solve these limitations whose aim is to reduce the data size by filtering out noisy data from a given training dataset. In this paper, we introduce a novel algorithm for these tasks, namely a biological-based genetic algorithm (BGA). BGA fits a ''biological evolution'' into the evolutionary process, where the most streamlined process also complies with the reasonable rules. In other words, after long-term evolution, organisms find the most efficient way to allocate resources and evolve. Consequently, we can closely simulate the natural evolution of an algorithm, such that the algorithm will be both efficient and effective. The experimental results based on the TechTC-100 and Reuters-21578 datasets show the outperformance of BGA over five state-of-the-art algorithms. In particular, using BGA to select text documents not only results in the largest dataset reduction rate, but also requires the least computational time. Moreover, BGA can make the k-NN and SVM classifiers provide similar or slightly better classification accuracy than GA.