Interactions between document representation and feature selection in text categorization

Authors:
Miloš Radovanović;Mirjana Ivanović
Affiliations:
Department of Mathematics and Informatics, University of Novi Sad, Faculty of Science, Novi Sad, Serbia and Montenegro;Department of Mathematics and Informatics, University of Novi Sad, Faculty of Science, Novi Sad, Serbia and Montenegro
Venue:
DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Year:
2006

Citing 12
Cited 1

Estimating attributes: analysis and extensions of RELIEF

ECML-94 Proceedings of the European conference on machine learning on Machine Learning
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

Machine Learning
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

ICDT '01 Proceedings of the 8th International Conference on Database Theory
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

ICML '04 Proceedings of the twenty-first international conference on Machine learning
A personalized search engine based on web-snippet hierarchical clustering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Document representations for classification of short web-page descriptions

DaWaK'06 Proceedings of the 8th international conference on Data Warehousing and Knowledge Discovery
Multinomial naive bayes for text categorization revisited

AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence

The Chinese text categorization system with association rule and category priority

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many studies in automated Text Categorization focus on the performance of classifiers, with or without considering feature selection methods, but almost as a rule taking into account just one document representation. Only relatively recently did detailed studies on the impact of various document representations step into the spotlight, showing that there may be statistically significant differences in classifier performance even among variations of the classical bag-of-words model. This paper examines the relationship between the idf transform and several widely used feature selection methods, in the context of Naïve Bayes and Support Vector Machines classifiers, on datasets extracted from the dmoz ontology of Web-page descriptions. The described experimental study shows that the idf transform considerably effects the distribution of classification performance over feature selection reduction rates, and offers an evaluation method which permits the discovery of relationships between different document representations and feature selection methods which is independent of absolute differences in classification performance.