Ambiguity measure feature-selection algorithm

Authors:
Saket S. R. Mengle;Nazli Goharian
Affiliations:
Information Retrieval Lab, Illinois Institute of Technology, Chicago, IL 60616;Information Retrieval Lab, Illinois Institute of Technology, Chicago, IL 60616
Venue:
Journal of the American Society for Information Science and Technology
Year:
2009

Citing 0
Cited 6

Detecting relationships among categories using text classification

Journal of the American Society for Information Science and Technology
Context aware query classification using dynamic query window and relationship net

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A new feature selection algorithm based on binomial hypothesis testing for spam filtering

Knowledge-Based Systems
A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization

Information Processing and Management: an International Journal
A novel probabilistic feature selection method for text classification

Knowledge-Based Systems
Using micro-documents for feature selection: The case of ordinal text classification

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the increasing number of digital documents, the ability to automatically classify those documents both efficiently and accurately is becoming more critical and difficult. One of the major problems in text classification is the high dimensionality of feature space. We present the ambiguity measure (AM) feature-selection algorithm, which selects the most unambiguous features from the feature set. Unambiguous features are those features whose presence in a document indicate a strong degree of confidence that a document belongs to only one specific category. We apply AM feature selection on a naïve Bayes text classifier. We favorably show the effectiveness of our approach in outperforming eight existing feature-selection methods, using five benchmark datasets with a statistical significance of at least 95% confidence. The support vector machine (SVM) text classifier is shown to perform consistently better than the naïve Bayes text classifier. The drawback, however, is the time complexity in training a model. We further explore the effect of using the AM feature-selection method on an SVM text classifier. Our results indicate that the training time for the SVM algorithm can be reduced by more than 50%, while still improving the accuracy of the text classifier. We favorably show the effectiveness of our approach by demonstrating that it statistically significantly (99% confidence) outperforms eight existing feature-selection methods using four standard benchmark datasets. © 2009 Wiley Periodicals, Inc.