Using thesaurus to improve multiclass text classification

Authors:
Nooshin Maghsoodi;Mohammad Mehdi Homayounpour
Affiliations:
Laboratory of Intelligent Signal and Speech Processing, Faculty of Computer Engineering, Amirkabir University of Technology, Tehran, Iran;Laboratory of Intelligent Signal and Speech Processing, Faculty of Computer Engineering, Amirkabir University of Technology, Tehran, Iran
Venue:
CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Year:
2011

Citing 15
Cited 0

A comparison of classifiers and document representations for the routing problem

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
A vector space model for automatic indexing

Communications of the ACM
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving performance of text categorization by combining filtering and support vector machines: Research Articles

Journal of the American Society for Information Science and Technology
Fuzzy support vector machine for multi-class text categorization

Information Processing and Management: an International Journal
Using Wikipedia knowledge to improve text classification

Knowledge and Information Systems
Bayesian network models for hierarchical text classification from a thesaurus

International Journal of Approximate Reasoning
Feature generation for text categorization using world knowledge

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Information gain and divergence-based feature selection for machine learning-based text categorization

Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
Research of Chinese Text Classification Methods Based on Semantic Vector and Semantic Similarity

IFCSTA '09 Proceedings of the 2009 International Forum on Computer Science-Technology and Applications - Volume 02
Boosting for text classification with semantic features

WebKDD'04 Proceedings of the 6th international conference on Knowledge Discovery on the Web: advances in Web Mining and Web Usage Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the growing amount of textual information available on the Internet, the importance of automatic text classification has been increasing in the last decade. In this paper, a system was presented for the classification of multi-class Farsi documents which uses Support Vector Machine (SVM) classifier. The new idea proposed in the present paper, is based on extending the feature vector by adding some words extracted from a thesaurus. The goal is to assist classifier when training dataset is not comprehensive for some categories. For corpus preparation, Farsi Wikipedia website and articles of some archived newspapers and magazines are used. As the results indicate, classification efficiency improves by applying this approach. 0.89 micro F-measure were achieved for classification of 10 categories of Farsi texts.