On the utility of incremental feature selection for the classification of textual data streams

Authors:
Ioannis Katakis;Grigorios Tsoumakas;Ioannis Vlahavas
Affiliations:
Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece;Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece;Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
Venue:
PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
Year:
2005

Citing 17
Cited 8

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Learning in the presence of concept drift and hidden contexts

Machine Learning
Boosting and Rocchio applied to text filtering

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A patent search and classification system

Proceedings of the fourth ACM conference on Digital libraries
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Neural Networks for Web Content Filtering

IEEE Intelligent Systems
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Automatic Web Rating: Filtering Obscene Content on the Web

ECDL '00 Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries
Automatic Web Page Classification in a Dynamic and Hierarchical Way

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
A Neural Network Based Approach to Automated E-Mail Classification

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
The disputed federalist papers: SVM feature selection via concave minimization

Proceedings of the 2003 conference on Diversity in computing
"In vivo" spam filtering: a challenge problem for KDD

ACM SIGKDD Explorations Newsletter
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence

Aggregated cross-media news visualization and personalization

MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
Building a dynamic classifier for large text data collections

ADC '10 Proceedings of the Twenty-First Australasian Conference on Database Technologies - Volume 104
Context-aware collaborative data stream mining in ubiquitous devices

IDA'11 Proceedings of the 10th international conference on Advances in intelligent data analysis X
PersoNews: a personalized news reader enhanced by machine learning and semantic filtering

ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part I
On feature extraction for spam e-mail detection

MRCS'06 Proceedings of the 2006 international conference on Multimedia Content Representation, Classification and Security
Spam e-mail classification based on the IFWB algorithm

ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part I
A comparative study on feature selection and adaptive strategies for email foldering using the ABC-DynF framework

Knowledge-Based Systems
Sentiment analysis on evolving social streams: how self-report imbalances can help

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper we argue that incrementally updating the features that a text classification algorithm considers is very important for real-world textual data streams, because in most applications the distribution of data and the description of the classification concept changes over time. We propose the coupling of an incremental feature ranking method and an incremental learning algorithm that can consider different subsets of the feature vector during prediction (what we call a feature based classifier), in order to deal with the above problem. Experimental results with a longitudinal database of real spam and legitimate emails shows that our approach can adapt to the changing nature of streaming data and works much better than classical incremental learning algorithms.