Parallel rare term vector replacement: Fast and effective dimensionality reduction for text

Authors:
T. Berka;M. VajteršIc
Affiliations:
Department of Computer Sciences, University of Salzburg, Salzburg, Austria;Department of Computer Sciences, University of Salzburg, Salzburg, Austria and Department of Informatics, Mathematical Institute, Slovak Academy of Sciences, Bratislava, Slovak Republic
Venue:
Journal of Parallel and Distributed Computing
Year:
2013

Citing 20
Cited 0

Applied multivariate statistical analysis

Applied multivariate statistical analysis
Latent semantic indexing is an optimal special case of multidimensional scaling

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Generalized vector spaces model in information retrieval

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Nonlinear component analysis as a kernel eigenvalue problem

Neural Computation
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Matrices, Vector Spaces, and Information Retrieval

SIAM Review
Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval

Proceedings of the ninth international conference on Information and knowledge management
Concept decompositions for large sparse text data using clustering

Machine Learning
MPI: A Message-Passing Interface

MPI: A Message-Passing Interface
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
A comprehensive comparative study on term weighting schemes for text categorization with support vector machines

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Accuracy Control in Compressed Multidimensional Data Cubes for Quality of Answer-based OLAP Tools

SSDBM '06 Proceedings of the 18th International Conference on Scientific and Statistical Database Management
The phrase-based vector space model for automatic retrieval of free-text medical documents

Data & Knowledge Engineering
Graph-Based Multilevel Dimensionality Reduction with Applications to Eigenfaces and Latent Semantic Indexing

ICMLA '08 Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications
Lanczos Vectors versus Singular Vectors for Effective Dimension Reduction

IEEE Transactions on Knowledge and Data Engineering
Applications and explanations of Zipf's law

NeMLaP3/CoNLL '98 Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning
A generalized vector space model for text retrieval based on semantic relatedness

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Dimensionality reduction by self organizing maps that preserve distances in output space

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
A hierarchy-driven compression technique for advanced OLAP visualization of multidimensional data cubes

DaWaK'06 Proceedings of the 8th international conference on Data Warehousing and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dimensionality reduction is an established area in text mining and information retrieval. These methods convert the highly sparse corpus matrices into dense matrix format while preserving or improving the classification accuracy or retrieval performance. In this paper, we describe a novel approach to dimensionality reduction for text, along with a parallel algorithm suitable for private memory parallel computer systems. According to Zipf's law, the majority of indexing terms occurs only in a small number of documents. Our algorithm replaces rare terms by computing a vector which expresses their semantics in terms of common terms. This process produces a projection matrix, which can be applied to a corpus matrix and individual document and query vectors. We give an accurate mathematical and algorithmic description of our algorithms and present an experimental evaluation on two benchmark corpora. These experiments indicate that our algorithm can deliver a substantial reduction in the number of features, from 47,236 to 392 features on the Reuters corpus with a clear improvement in the retrieval performance. We have evaluated our parallel implementation using the message passing interface with up to 32 processes on a Nehalem Xeon cluster, computing the projection matrix for the dimensionality reduction for over 800,000 documents in just under 100 s.