A text categorization method based on local document frequency

Authors:
Feng Xia;Tian Jicun;Liu Zhihui
Affiliations:
School of Computer Science and Technology, Civil Aviation University of China, Tianjin, P.R.China;School of Computer Science and Technology, Civil Aviation University of China, Tianjin, P.R.China;School of Computer Science and Technology, Civil Aviation University of China, Tianjin, P.R.China
Venue:
FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Year:
2009

Citing 9
Cited 0

A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An analysis of the relative hardness of Reuters-21578 subsets: Research Articles

Journal of the American Society for Information Science and Technology
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Training linear SVMs in linear time

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, a fast and effective text categorization method named TCBLDF is proposed. TCBLDF barely needs dimensionality reduction except a stop words removal and a document frequency based feature selection. It tries to capture the relationship between a term and a category label, thus eliminates the need to know the semantic contribution of a term makes to a document it occurs in. TCBLDF use a measure to evaluate the importance of each term for the categorization task, and then gives different weights to them according to the importance evaluations. By doing so, we can make important terms affect more when making classification decision. At last we compare the method to two conventional classification methods, a Naive Bayesian learning and a linear SVM learning method. Experimental results show that TCBLDF is faster than SVM with a comparable performance and more effective than Naive Bayes, thus can be a good alternative to these methods.