Combining labeled and unlabeled data with co-training
COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Advances in kernel methods: support vector learning
Advances in kernel methods: support vector learning
Statistical phrases for vector-space information retrieval (poster abstract)
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Analyzing the effectiveness and applicability of co-training
Proceedings of the ninth international conference on Information and knowledge management
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Improving Short-Text Classification using Unlabeled Data for Classification Problems
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Automatic text categorization in terms of genre and author
Computational Linguistics
A simple rule-based part of speech tagger
ANLC '92 Proceedings of the third conference on Applied natural language processing
Use of support vector learning for chunk identification
ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Web page classification: Features and algorithms
ACM Computing Surveys (CSUR)
A discriminative model for semi-supervised learning
Journal of the ACM (JACM)
Hi-index | 0.00 |
Most document classification systems consider only the distribution of content words of the documents, ignoring the syntactic information underlying the documents though it is also an important factor. In this paper, we present an approach for classifying large scale unstructured documents by incorporating both lexical and syntactic information of documents. For this purpose, we use the co-training algorithm, a partially supervised learning algorithm, in which two separated views for the training data are employed and the small number of labeled data are augmented by a large number of unlabeled data. Since both lexical and syntactic information can play roles of separated views for the unstructured documents, the co-training algorithm enhances the performance of document classification using both of them and a large number of unlabeled documents. The experimental results on Reuters-21578 corpus and TREC-7 filtering documents show the effectiveness of unlabeled documents and the use of both lexical and syntactic information.