Co-trained support vector machines for large scale unstructured document classification using unlabeled data and syntactic information

Authors:
Seong-Bae Park;Byoung-Tak Zhang
Affiliations:
School of Computer Science and Engineering, Seoul National University, 151-744 Seoul, South Korea;School of Computer Science and Engineering, Seoul National University, 151-744 Seoul, South Korea
Venue:
Information Processing and Management: an International Journal
Year:
2004

Citing 19
Cited 11

Optimization of relevance feedback weights

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Natural language processing for information retrieval

Communications of the ACM
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Advances in kernel methods: support vector learning

Advances in kernel methods: support vector learning
Statistical phrases for vector-space information retrieval (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text filtering by boosting naive Bayes classifiers

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Analyzing the effectiveness and applicability of co-training

Proceedings of the ninth international conference on Information and knowledge management
Relevance and reinforcement in interactive browsing

Proceedings of the ninth international conference on Information and knowledge management
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Relevance Feedback using Support Vector Machines

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Boosted Maximum Entropy Model for Learning Text Chunking

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Improving Short-Text Classification using Unlabeled Data for Classification Problems

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Automatic text categorization in terms of genre and author

Computational Linguistics
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Use of support vector learning for chunk identification

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7

A robust multilingual portable phrase chunking system

Expert Systems with Applications: An International Journal
Efficient text chunking using linear kernel with masked method

Knowledge-Based Systems
Joint feature re-extraction and classification using an iterative semi-supervised support vector machine algorithm

Machine Learning
The value of agreement a new boosting algorithm

Journal of Computer and System Sciences
A self-training semi-supervised SVM algorithm and its application in an EEG-based brain computer interface speller system

Pattern Recognition Letters
Robust and efficient multiclass SVM models for phrase pattern recognition

Pattern Recognition
Agnostic active learning

Journal of Computer and System Sciences
A general and multi-lingual phrase chunking model based on masking method

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
The value of agreement, a new boosting algorithm

COLT'05 Proceedings of the 18th annual conference on Learning Theory
A PAC-Style model for learning from labeled and unlabeled data

COLT'05 Proceedings of the 18th annual conference on Learning Theory
Co-training on multi-view unlabelled data

Proceedings of the 27th Conference on Image and Vision Computing New Zealand

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most document classification systems consider only the distribution of content words of the documents, ignoring the syntactic information underlying the documents though it is also an important factor. In this paper, we present an approach for classifying large scale unstructured documents by incorporating both the lexical and the syntactic information of documents. For this purpose, we use the co-training algorithm, a partially supervised learning algorithm, in which two separated views for the training data are employed and the small number of labeled data are augmented by the large number of unlabeled data. Since both the lexical and the syntactic information can play roles of separated views for the unstructured documents, the co-training algorithm enhances the performance of document classification using both of them and a large number of unlabeled documents. The experimental results on Reuters-21578 corpus and TREC-7 filtering documents show the effectiveness of unlabeled documents and the use of both the lexical and the syntactic information.