Boosting the Feature Space: Text Classification for Unstructured Data on the Web

Authors:
Yang Song;Ding Zhou;Jian Huang;Isaac G. Councill;Hongyuan Zha;C. Lee Giles
Affiliations:
The Pennsylvania State University, USA;The Pennsylvania State University, USA;The Pennsylvania State University, USA;The Pennsylvania State University, USA;The Pennsylvania State University, USA;The Pennsylvania State University, USA
Venue:
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Year:
2006

Citing 0
Cited 4

Extraction of unexpected sentences: A sentiment classification assessed approach

Intelligent Data Analysis
Enterprise data classification using semantic web technologies

ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part II
TAKES: a fast method to select features in the kernel space

Proceedings of the 20th ACM international conference on Information and knowledge management
Feature selection for link prediction

Proceedings of the 5th Ph.D. workshop on Information and knowledge

Quantified Score

Hi-index	0.00

Visualization

Abstract

The issue of seeking efficient and effective methods for classifying unstructured text in large document corpora has received much attention in recent years. Traditional document representation like bag-of-words encodes documents as feature vectors, which usually leads to sparse feature spaces with large dimensionality, thus making it hard to achieve high classification accuracies. This paper addresses the problem of classifying unstructured documents on the Web. A classification approach is proposed that utilizes traditional feature reduction techniques along with a collaborative filtering method for augmenting document feature spaces. The method produces feature spaces with an order of magnitude less features compared with a baseline bag-of-words feature selection method. Experiments on both real-world data and benchmark corpus indicate that our approach improves classification accuracy over the traditional methods for both Support Vector Machines and AdaBoost classifiers.