Persian text classification based on K-NN using wordnet

Authors:
Mostafa Parchami;Bahareh Akhtar;MirHossein Dezfoulian
Affiliations:
Department of Computer Engineering, Bu-Ali Sina University, Hamedan, Iran;Department of Computer Engineering, Bu-Ali Sina University, Hamedan, Iran;Department of Computer Engineering, Bu-Ali Sina University, Hamedan, Iran
Venue:
IEA/AIE'12 Proceedings of the 25th international conference on Industrial Engineering and Other Applications of Applied Intelligent Systems: advanced research in applied artificial intelligence
Year:
2012

Citing 8
Cited 0

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Centroid-Based Document Classification: Analysis and Experimental Results

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
A novel feature selection algorithm for text categorization

Expert Systems with Applications: An International Journal
Rich document representation and classification: An analysis

Knowledge-Based Systems
Hamshahri: A standard Persian text collection

Knowledge-Based Systems
A study on similarity and relatedness using distributional and WordNet-based approaches

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
An effective refinement strategy for KNN text classifier

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

K-NN is widely used for text classification purpose. Basic K-NN has poor accuracy; other methods should be applied to basic K-NN to improve accuracy and efficiency. In this paper we propose a method that uses wordnet to increase similarity of documents under the same category. Documents are represented by single words and their frequencies, by using wordnet, frequency of related words is changed to acquire higher accuracy. Information gained is used to eliminate terms that are not discriminated. Words like "and", "or" and "that" in English are not important in text classification and the best way to eliminate them is to calculate their information gain. PCA is used to reduce number of features and increase speed of the method. Applying this method, we designed a faster and much accurate classifier for Persian language. Experiments show that applying this preprocessing will increase accuracy and speed of K-NN. Accuracy of the proposed K-NN classifier on Hamshahri corpus is 88.18%.