A new semi-supervised dimension reduction technique for textual data analysis

Authors:
Manuel Martín-Merino;Jesus Román
Affiliations:
Universidad Pontificia de Salamanca, Salamanca, Spain;Universidad Pontificia de Salamanca, Salamanca, Spain
Venue:
IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
Year:
2006

Citing 8
Cited 0

Latent semantic indexing is an optimal special case of multidimensional scaling

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Matrices, Vector Spaces, and Information Retrieval

SIAM Review
A corpus-based approach to comparative evaluation of statistical term association measures

Journal of the American Society for Information Science and Technology
Learning from Data: Concepts, Theory, and Methods

Learning from Data: Concepts, Theory, and Methods
Modern Information Retrieval

Modern Information Retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A New Sammon Algorithm for Sparse Data Visualization

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 1 - Volume 01
Artificial neural networks for feature extraction and multivariate data projection

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dimension reduction techniques are important preprocessing algorithms for high dimensional applications that reduce the noise keeping the main structure of the dataset. They have been successfully applied to a large variety of problems and particularly in text mining applications. However, the algorithms proposed in the literature often suffer from a low discriminant power due to its unsupervised nature and to the ‘curse of dimensionality’. Fortunately several search engines such as Yahoo provide a manually created classification of a subset of documents that may be exploited to overcome this problem. In this paper we propose a semi-supervised version of a PCA like algorithm for textual data analysis. The new method reduces the term space dimensionality taking advantage of this document classification. The proposed algorithm has been evaluated using a text mining problem and it outperforms well known unsupervised techniques.