Distributional Features for Text Categorization

Authors:
Xiao-Bing Xue;Zhi-Hua Zhou
Affiliations:
Nanjing University, Nanjing;Nanjing University, Nanjing
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2009

Citing 0
Cited 8

A negative category based approach for Wikipedia document classification

International Journal of Knowledge Engineering and Data Mining
Fast text categorization using concise semantic analysis

Pattern Recognition Letters
A multi-level matching method with hybrid similarity for document retrieval

Expert Systems with Applications: An International Journal
A parallel ACO algorithm to select terms to categorise longer documents

International Journal of Computational Science and Engineering
Capturing correlations of multiple labels: A generative probabilistic model for multi-label learning

Neurocomputing
A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine

Expert Systems with Applications: An International Journal
An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization

Applied Intelligence
A global-ranking local feature selection method for text categorization

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

Text categorization is the task of assigning predefined categories to natural language text. With the widely used 'bag of words' representation, previous researches usually assign a word with values such that whether this word appears in the document concerned or how frequently this word appears. Although these values are useful for text categorization, they have not fully expressed the abundant information contained in the document. This paper explores the effect of other types of values, which express the distribution of a word in the document. These novel values assigned to a word are called {\it distributional features}, which include the compactness of the appearances of the word and the position of the first appearance of the word. The proposed distributional features are exploited by a {\it tfidf} style equation and different features are combined using ensemble learning techniques. Experiments show that the distributional features are useful for text categorization. In contrast to using the traditional term frequency values solely, including the distributional features requires only a little additional cost, while the categorization performance can be significantly improved. Further analysis shows that the distributional features are especially useful when documents are long and the writing style is casual.