Text classification on a grid environment

  • Authors:
  • Valeriana G. Roncero;Myrian C. A. Costa;Nelson F. F. Ebecken

  • Affiliations:
  • COPPE, Federal University of Rio de Janeiro, Centro de Tecnologia, Rio de Janeiro, RJ, Brazil;COPPE, Federal University of Rio de Janeiro, Centro de Tecnologia, Rio de Janeiro, RJ, Brazil;COPPE, Federal University of Rio de Janeiro, Centro de Tecnologia, Rio de Janeiro, RJ, Brazil

  • Venue:
  • VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The enormous amount of information stored in unstructured texts cannot simply be used for further processing by computers, which typically handle text as simple sequences of character strings. Text mining is the process of extracting interesting information and knowledge from unstructured text. One key difficulty with text classification learning algorithms is that they require many hand-labeled documents to learn accurately. In the text mining pattern discovery phase, the text classification step aims at automatically attribute one or more predefined classes to text documents. In this research, we propose to use an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naïve Bayes classifier on a grid environment, this combination is based on a mixture of multinomials, which is commonly used in text classification. Naïve Bayes is a probabilistic approach to inductive learning. It estimates the a posteriori probability that a document belongs to a class given the observed feature values of the documents, assuming independence of the features. The class with the maximum a posteriori probability is assigned to the document. Expectation-Maximization (EM) is a class of iterative algorithms for maximum likelihood or maximum a posteriori estimation in problems with unlabeled data. The grid environment is a geographically distributed computation infrastructure composed of a set of heterogeneous resources. The semi-supervised learning classifier in the grid is available as a grid service, expanding the functionality of Aîuri Portal, which is a framework for a cooperative academic environment for education and research. Text classification mining methods are time-consuming by using the grid infrastructure can bring significant benefits in learning and the classification process.