Iterative cross-training: An algorithm for web page categorization

Authors:
Nuanwan Soonthornphisaj;Boonserm Kijsirikul
Affiliations:
(Correspd.) Mach. Intell. and Knowl. Disc. Lab., Dept. of Comp. Eng., Fac. of Eng., Chulalongkorn Univ., Bangkok, Thailand, 10330. Tel. +661 6592877/ Fax. +662 2186955/ Nuanwan@chula.com;Machine Intelligence and Knowledge Discovery Laboratory, Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand, 10330. Tel.: +662 2186956/ Fax: +6 ...
Venue:
Intelligent Data Analysis
Year:
2003

Citing 6
Cited 0

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Information Retrieval

Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Employing EM and Pool-Based Active Learning for Text Classification

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

The goal of Web page categorization is to classify Web documents into a certain number of predefined categories. Previous works in this area employed a large number of labeled training documents for supervised learning. The problem is that, it is difficult to create labeled training documents. Though it is not so easy to manually categorize unlabeled documents for creating training data, it is easy to collect unlabeled ones. Therefore, a new machine learning algorithm is investigated to overcome these difficulties and effectively utilize unlabeled documents. We propose a novel approach called Iterative Cross-Training (ICT). In this paper, we applied the algorithm to Web page categorization on three data sets. The performance of ICT was evaluated and analyzed with the supervised learning algorithms, Co-Training and Expectation Maximization. We found that ICT is considered to be an effective approach for the Web page categorization task.