Evaluating and optimizing autonomous text classification systems
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing
Communications of the ACM
Feature Engineering for Text Classification
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
A novel refinement approach for text categorization
Proceedings of the 14th ACM international conference on Information and knowledge management
Supervised and Traditional Term Weighting Methods for Automatic Text Categorization
IEEE Transactions on Pattern Analysis and Machine Intelligence
Distributional Features for Text Categorization
IEEE Transactions on Knowledge and Data Engineering
Wikipedia-based semantic interpretation for natural language processing
Journal of Artificial Intelligence Research
Neural Computing and Applications
Combination of feature selection approaches with SVM in credit scoring
Expert Systems with Applications: An International Journal
A new document author representation for authorship attribution
MCPR'12 Proceedings of the 4th Mexican conference on Pattern Recognition
Nonlinear transformation of term frequencies for term weighting in text categorization
Engineering Applications of Artificial Intelligence
Hi-index | 0.10 |
Text representation is a necessary procedure for text categorization tasks. Currently, bag of words (BOW) is the most widely used text representation method but it suffers from two drawbacks. First, the quantity of words is huge; second, it is not feasible to calculate the relationship between words. Semantic analysis (SA) techniques help BOW overcome these two drawbacks by interpreting words and documents in a space of concepts. However, existing SA techniques are not designed for text categorization and often incur huge computing cost. This paper proposes a concise semantic analysis (CSA) technique for text categorization tasks. CSA extracts a few concepts from category labels and then implements concise interpretation on words and documents. These concepts are small in quantity and great in generality and tightly related to the category labels. Therefore, CSA preserves necessary information for classifiers with very low computing cost. To evaluate CSA, experiments on three data sets (Reuters-21578, 20-NewsGroup and Tancorp) were conducted and the results show that CSA reaches a comparable micro- and macro-F"1 performance with BOW, if not better one. Experiments also show that CSA helps dimension sensitive learning algorithms such as k-nearest neighbor (kNN) to eliminate the ''Curse of Dimensionality'' and as a result reaches a comparable performance with support vector machine (SVM) in text categorization applications. In addition, CSA is language independent and performs equally well both in Chinese and English.