Parallel rare term vector replacement: Fast and effective dimensionality reduction for text

  • Authors:
  • T. Berka;M. VajteršIc

  • Affiliations:
  • Department of Computer Sciences, University of Salzburg, Salzburg, Austria;Department of Computer Sciences, University of Salzburg, Salzburg, Austria and Department of Informatics, Mathematical Institute, Slovak Academy of Sciences, Bratislava, Slovak Republic

  • Venue:
  • Journal of Parallel and Distributed Computing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Dimensionality reduction is an established area in text mining and information retrieval. These methods convert the highly sparse corpus matrices into dense matrix format while preserving or improving the classification accuracy or retrieval performance. In this paper, we describe a novel approach to dimensionality reduction for text, along with a parallel algorithm suitable for private memory parallel computer systems. According to Zipf's law, the majority of indexing terms occurs only in a small number of documents. Our algorithm replaces rare terms by computing a vector which expresses their semantics in terms of common terms. This process produces a projection matrix, which can be applied to a corpus matrix and individual document and query vectors. We give an accurate mathematical and algorithmic description of our algorithms and present an experimental evaluation on two benchmark corpora. These experiments indicate that our algorithm can deliver a substantial reduction in the number of features, from 47,236 to 392 features on the Reuters corpus with a clear improvement in the retrieval performance. We have evaluated our parallel implementation using the message passing interface with up to 32 processes on a Nehalem Xeon cluster, computing the projection matrix for the dimensionality reduction for over 800,000 documents in just under 100 s.