Iterative cross-training: An algorithm for learning from unlabeled Web pages

Authors:
Nuanwan Soonthornphisaj;Boonserm Kijsirikul
Affiliations:
Machine Intelligence and Knowledge Discovery Laboratory, Department of Computer Engineering, Chulalongkorn University, Bangkok 10330, Thailand;Machine Intelligence and Knowledge Discovery Laboratory, Department of Computer Engineering, Chulalongkorn University, Bangkok 10330, Thailand
Venue:
International Journal of Intelligent Systems - Intelligent Technologies
Year:
2004

Citing 0
Cited 4

Exploration of textual document archives using a fuzzy hierarchical clustering algorithm in the GAMBAL system

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
A new cross-training approach by using labeled data

Proceedings of the 2009 ACM symposium on Applied Computing
Harvesting Regional Transliteration Variants with Guided Search

ICCPOL '09 Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy
Learning regional transliteration variants

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The article presents a new learning method, called iterative cross-training (ICT), for classifying Web pages in three classification problems, i.e., (1) classification of Thai/non-Thai Web pages, (2) classification of course/non-course home pages, and (3) classification of university-related Web pages. Given domain knowledge or a small set of labeled data, our method combines two classifiers that are able to use effectively unlabeled examples to iteratively train each other. We compare ICT against the other learning methods: a supervised word segmentation classifier, a supervised naïve Bayes classifier, and a co–training-style classifier. The experimental results on three classification problems show that ICT gives better performance than those of the other classifiers. One of the advantages of ICT is that it needs only a small set of prelabeled data or no prelabeled data in the case that domain knowledge is available. © 2004 Wiley Periodicals, Inc.