Enhancing text classification by information embedded in the test set

Authors:
Gabriela Ramírez-de-la-Rosa;Manuel Montes-y-Gómez;Luis Villaseńor-Pineda
Affiliations:
Laboratory of Language Technologies, National Institute of Astrophysics, Optics and Electronics, Pue., Mexico;Laboratory of Language Technologies, National Institute of Astrophysics, Optics and Electronics, Pue., Mexico;Laboratory of Language Technologies, National Institute of Astrophysics, Optics and Electronics, Pue., Mexico
Venue:
CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Year:
2010

Citing 8
Cited 0

Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Centroid-Based Document Classification: Analysis and Experimental Results

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Semi-supervised single-label text categorization using centroid-based classifiers

Proceedings of the 2007 ACM symposium on Applied computing
Using clustering to enhance text classification

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
An improved centroid classifier for text categorization

Expert Systems with Applications: An International Journal
A class-feature-centroid classifier for text categorization

Proceedings of the 18th international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current text classification methods are mostly based on a supervised approach, which require a large number of examples to build models accurate. Unfortunately, in several tasks training sets are extremely small and their generation is very expensive. In order to tackle this problem in this paper we propose a new text classification method that takes advantage of the information embedded in the own test set. This method is supported on the idea that similar documents must belong to the same category. Particularly, it classifies the documents by considering not only their own content but also information about the assigned category to other similar documents from the same test set. Experimental results in four data sets of different sizes are encouraging. They indicate that the proposed method is appropriate to be used with small training sets, where it could significantly outperform the results from traditional approaches such as Naive Bayes and Support Vector Machines.