Text classification on a grid environment

Authors:
Valeriana G. Roncero;Myrian C. A. Costa;Nelson F. F. Ebecken
Affiliations:
COPPE, Federal University of Rio de Janeiro, Centro de Tecnologia, Rio de Janeiro, RJ, Brazil;COPPE, Federal University of Rio de Janeiro, Centro de Tecnologia, Rio de Janeiro, RJ, Brazil;COPPE, Federal University of Rio de Janeiro, Centro de Tecnologia, Rio de Janeiro, RJ, Brazil
Venue:
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Year:
2010

Citing 14
Cited 0

Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Data mining: concepts and techniques

Data mining: concepts and techniques
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Machine Learning

Machine Learning
Modern Information Retrieval

Modern Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
The Anatomy of the Grid: Enabling Scalable Virtual Organizations

International Journal of High Performance Computing Applications
Natural Language Processing and Text Mining

Natural Language Processing and Text Mining
HAND: Highly Available Dynamic Deployment Infrastructure for Globus Toolkit 4

PDP '07 Proceedings of the 15th Euromicro International Conference on Parallel, Distributed and Network-Based Processing
Text Mining Grid Services for Multiple Environments

High Performance Computing for Computational Science - VECPAR 2008

Quantified Score

Hi-index	0.00

Visualization

Abstract

The enormous amount of information stored in unstructured texts cannot simply be used for further processing by computers, which typically handle text as simple sequences of character strings. Text mining is the process of extracting interesting information and knowledge from unstructured text. One key difficulty with text classification learning algorithms is that they require many hand-labeled documents to learn accurately. In the text mining pattern discovery phase, the text classification step aims at automatically attribute one or more predefined classes to text documents. In this research, we propose to use an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naïve Bayes classifier on a grid environment, this combination is based on a mixture of multinomials, which is commonly used in text classification. Naïve Bayes is a probabilistic approach to inductive learning. It estimates the a posteriori probability that a document belongs to a class given the observed feature values of the documents, assuming independence of the features. The class with the maximum a posteriori probability is assigned to the document. Expectation-Maximization (EM) is a class of iterative algorithms for maximum likelihood or maximum a posteriori estimation in problems with unlabeled data. The grid environment is a geographically distributed computation infrastructure composed of a set of heterogeneous resources. The semi-supervised learning classifier in the grid is available as a grid service, expanding the functionality of Aîuri Portal, which is a framework for a cooperative academic environment for education and research. Text classification mining methods are time-consuming by using the grid infrastructure can bring significant benefits in learning and the classification process.