Improving Farsi multiclass text classification using a thesaurus and two-stage feature selection

Authors:
Nooshin Maghsoodi;Mohammad Mehdi Homayounpour
Affiliations:
Laboratory of Intelligent Signal and Speech Processing, Faculty of Computer Engineering, Amirkabir University of Technology, Tehran, Iran;Laboratory of Intelligent Signal and Speech Processing, Faculty of Computer Engineering, Amirkabir University of Technology, Tehran, Iran
Venue:
Journal of the American Society for Information Science and Technology
Year:
2011

Citing 0
Cited 1

Using micro-documents for feature selection: The case of ordinal text classification

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The progressive increase of information content has recently made it necessary to create a system for automatic classification of documents. In this article, a system is presented for the categorization of multiclass Farsi documents that requires fewer training examples and can help to compensate the shortcoming of the standard training dataset. The new idea proposed in the present article is based on extending the feature vector by adding some words extracted from a thesaurus and then filtering the new feature vector by applying secondary feature selection to discard inappropriate features. In fact, a phase of secondary feature selection is applied to choose more appropriate features among the features added from a thesaurus to enhance the effect of using a thesaurus on the efficiency of the classifier. To evaluate the proposed system, a corpus is gathered from the Farsi Wikipedia website and some articles in the Hamshahri newspaper, the Roshd periodical, and the Soroush magazine. In addition to studying the role of a thesaurus and applying secondary feature selection, the effect of a various number of categories, size of the training dataset, and average number of words in the test data also are examined. As the results indicate, classification efficiency improves by applying this approach, especially when available data is not sufficient for some text categories. © 2011 Wiley Periodicals, Inc.