Applied multivariate statistical analysis
Applied multivariate statistical analysis
Latent semantic indexing is an optimal special case of multidimensional scaling
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Generalized vector spaces model in information retrieval
SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Nonlinear component analysis as a kernel eigenvalue problem
Neural Computation
Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the ninth international conference on Information and knowledge management
Concept decompositions for large sparse text data using clustering
Machine Learning
MPI: A Message-Passing Interface
MPI: A Message-Passing Interface
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Accuracy Control in Compressed Multidimensional Data Cubes for Quality of Answer-based OLAP Tools
SSDBM '06 Proceedings of the 18th International Conference on Scientific and Statistical Database Management
The phrase-based vector space model for automatic retrieval of free-text medical documents
Data & Knowledge Engineering
ICMLA '08 Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications
Lanczos Vectors versus Singular Vectors for Effective Dimension Reduction
IEEE Transactions on Knowledge and Data Engineering
Applications and explanations of Zipf's law
NeMLaP3/CoNLL '98 Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning
A generalized vector space model for text retrieval based on semantic relatedness
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Dimensionality reduction by self organizing maps that preserve distances in output space
IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
DaWaK'06 Proceedings of the 8th international conference on Data Warehousing and Knowledge Discovery
Hi-index | 0.00 |
Dimensionality reduction is an established area in text mining and information retrieval. These methods convert the highly sparse corpus matrices into dense matrix format while preserving or improving the classification accuracy or retrieval performance. In this paper, we describe a novel approach to dimensionality reduction for text, along with a parallel algorithm suitable for private memory parallel computer systems. According to Zipf's law, the majority of indexing terms occurs only in a small number of documents. Our algorithm replaces rare terms by computing a vector which expresses their semantics in terms of common terms. This process produces a projection matrix, which can be applied to a corpus matrix and individual document and query vectors. We give an accurate mathematical and algorithmic description of our algorithms and present an experimental evaluation on two benchmark corpora. These experiments indicate that our algorithm can deliver a substantial reduction in the number of features, from 47,236 to 392 features on the Reuters corpus with a clear improvement in the retrieval performance. We have evaluated our parallel implementation using the message passing interface with up to 32 processes on a Nehalem Xeon cluster, computing the projection matrix for the dimensionality reduction for over 800,000 documents in just under 100 s.