A parallel learning algorithm for text classification

Authors:
Canasai Kruengkrai;Chuleerat Jaruskulchai
Affiliations:
Kasetsart University, Bangkok, Thailand;Kasetsart University, Bangkok, Thailand
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 9
Cited 2

Clustering and classification of large document bases in a parallel environment

Journal of the American Society for Information Science
Using MPI (2nd ed.): portable parallel programming with the message-passing interface

Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Optimizing noncontiguous accesses in MPI – IO

Parallel Computing
Machine Learning

Machine Learning
Modern Information Retrieval

Modern Information Retrieval
Parallel Mining of Association Rules

IEEE Transactions on Knowledge and Data Engineering
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium

Exploitation of a parallel clustering algorithm on commodity hardware with P2P-MPI

The Journal of Supercomputing
Parallel text categorization for multi-dimensional data

PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text classification is the process of classifying documents into predefined categories based on their content. Existing supervised learning algorithms to automatically classify text need sufficient labeled documents to learn accurately. Applying the Expectation-Maximization (EM) algorithm to this problem is an alternative approach that utilizes a large pool of unlabeled documents to augment the available labeled documents. Unfortunately, the time needed to learn with these large unlabeled documents is too high. This paper introduces a novel parallel learning algorithm for text classification task. The parallel algorithm is based on the combination of the EM algorithm and the naive Bayes classifier. Our goal is to improve the computational time in learning and classifying process. We studied the performance of our parallel algorithm on a large Linux PC cluster called PIRUN Cluster. We report both timing and accuracy results. These results indicate that the proposed parallel algorithm is capable of handling large document collections.