Converting numerical classification into text classification

Authors:
Sofus A. Macskassy;Haym Hirsh;Arunava Banerjee;Aynur A. Dayanik
Affiliations:
Department of Computer Science, Rutgers University, 110 Frelinghuysen Rd, Piscataway, NJ;Department of Computer Science, Rutgers University, 110 Frelinghuysen Rd, Piscataway, NJ;Brain and Cognitive Sciences, University of Rochester, Rochester, NY and Department of Computer Science, Rutgers University, 110 Frelinghuysen Rd, Piscataway, NJ;Department of Computer Science, Rutgers University, 110 Frelinghuysen Rd, Piscataway, NJ
Venue:
Artificial Intelligence
Year:
2003

Citing 13
Cited 4

Neural network learning and expert systems

Neural network learning and expert systems
On changing continuous attributes into ordered discrete attributes

EWSL-91 Proceedings of the European working session on learning on Machine learning
C4.5: programs for machine learning

C4.5: programs for machine learning
An example-based mapping method for text categorization and retrieval

ACM Transactions on Information Systems (TOIS)
The nature of statistical learning theory

The nature of statistical learning theory
Boosting and Rocchio applied to text filtering

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Machine Learning

Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Making Better Use of Global Discretization

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Learning trees and rules with set-valued features

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

Adding numbers to text classification

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Ameva: An autonomous discretization algorithm

Expert Systems with Applications: An International Journal
The C-ND tree: a multidimensional index for hybrid continuous and non-ordered discrete data spaces

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Uncertainty management in rule-based information extraction systems

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Consider a supervised learning problem in which examples contain both numerical- and text-valued features. To use traditional feature-vector-based learning methods, one could treat the presence or absence of a word as a Boolean feature and use these binary-valued features together with the numerical features. However, the use of a text-classification system on this is a bit more problematic-in the most straight-forward approach each number would be considered a distinct token and treated as a word. This paper presents an alternative approach for the use of text classification methods for supervised learning problems with numerical-valued features in which the numerical features are converted into bag-of-words features, thereby making them directly usable by text classification methods. We show that even on purely numerical-valued data the results of text classification on the derived text-like representation outperforms the more naive numbers-as-tokens representation and, more importantly, is competitive with mature numerical classification methods such as C4.5, Ripper, and SVM. We further show that on mixed-mode data adding numerical features using our approach can improve performance over not adding those features.