Using thesaurus to improve multiclass text classification

  • Authors:
  • Nooshin Maghsoodi;Mohammad Mehdi Homayounpour

  • Affiliations:
  • Laboratory of Intelligent Signal and Speech Processing, Faculty of Computer Engineering, Amirkabir University of Technology, Tehran, Iran;Laboratory of Intelligent Signal and Speech Processing, Faculty of Computer Engineering, Amirkabir University of Technology, Tehran, Iran

  • Venue:
  • CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the growing amount of textual information available on the Internet, the importance of automatic text classification has been increasing in the last decade. In this paper, a system was presented for the classification of multi-class Farsi documents which uses Support Vector Machine (SVM) classifier. The new idea proposed in the present paper, is based on extending the feature vector by adding some words extracted from a thesaurus. The goal is to assist classifier when training dataset is not comprehensive for some categories. For corpus preparation, Farsi Wikipedia website and articles of some archived newspapers and magazines are used. As the results indicate, classification efficiency improves by applying this approach. 0.89 micro F-measure were achieved for classification of 10 categories of Farsi texts.