Fast text categorization using concise semantic analysis

Authors:
Zhixing Li;Zhongyang Xiong;Yufang Zhang;Chunyong Liu;Kuan Li
Affiliations:
Department of Computer Science, Chongqing University, China;Department of Computer Science, Chongqing University, China;Department of Computer Science, Chongqing University, China;Department of Computer Science, Chongqing University, China;Department of Computer Science, Chongqing University, China
Venue:
Pattern Recognition Letters
Year:
2011

Citing 13
Cited 2

Evaluating and optimizing autonomous text classification systems

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

Machine Learning
Feature Engineering for Text Classification

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
A novel refinement approach for text categorization

Proceedings of the 14th ACM international conference on Information and knowledge management
Supervised and Traditional Term Weighting Methods for Automatic Text Categorization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Distributional Features for Text Categorization

IEEE Transactions on Knowledge and Data Engineering
Wikipedia-based semantic interpretation for natural language processing

Journal of Artificial Intelligence Research
Text categorization based on combination of modified back propagation neural network and latent semantic analysis

Neural Computing and Applications
Combination of feature selection approaches with SVM in credit scoring

Expert Systems with Applications: An International Journal
Metric learning with feature decomposition for image categorization

Neurocomputing

A new document author representation for authorship attribution

MCPR'12 Proceedings of the 4th Mexican conference on Pattern Recognition
Nonlinear transformation of term frequencies for term weighting in text categorization

Engineering Applications of Artificial Intelligence

Quantified Score

Hi-index	0.10

Visualization

Abstract

Text representation is a necessary procedure for text categorization tasks. Currently, bag of words (BOW) is the most widely used text representation method but it suffers from two drawbacks. First, the quantity of words is huge; second, it is not feasible to calculate the relationship between words. Semantic analysis (SA) techniques help BOW overcome these two drawbacks by interpreting words and documents in a space of concepts. However, existing SA techniques are not designed for text categorization and often incur huge computing cost. This paper proposes a concise semantic analysis (CSA) technique for text categorization tasks. CSA extracts a few concepts from category labels and then implements concise interpretation on words and documents. These concepts are small in quantity and great in generality and tightly related to the category labels. Therefore, CSA preserves necessary information for classifiers with very low computing cost. To evaluate CSA, experiments on three data sets (Reuters-21578, 20-NewsGroup and Tancorp) were conducted and the results show that CSA reaches a comparable micro- and macro-F"1 performance with BOW, if not better one. Experiments also show that CSA helps dimension sensitive learning algorithms such as k-nearest neighbor (kNN) to eliminate the ''Curse of Dimensionality'' and as a result reaches a comparable performance with support vector machine (SVM) in text categorization applications. In addition, CSA is language independent and performs equally well both in Chinese and English.