Combining labeled and unlabeled data with co-training
COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Building a Text Classifier by a Keyword and Unlabeled Documents
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Building a Text Classifier by a Keyword and Wikipedia Knowledge
ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Editorial: Classifying text streams by keywords using classifier ensemble
Data & Knowledge Engineering
Hi-index | 0.00 |
Automatic classification of documents is an importantarea of research with many applications in the fields of documentsearching, forensics and others. Methods to performclassification of text rely on the existence of a sample of documentswhose class labels are known. However, in manysituations, obtaining this sample may not be an easy (oreven possible) task. In this paper we focus on the classificationof unlabelled documents into two classes: relevant andirrelevant, given a topic of interest. By dividing the set ofdocuments into buckets (for instance, answers returned bydifferent search engines), and using association rule miningto find common sets of words among the buckets, we can efficientlyobtain a sample of documents that has a large percentageof relevant ones. This sample can be used to trainmodels to classify the entire set of documents. We prove, viaexperimentation, that our method is capable of filtering relevantdocuments even in adverse conditions where the percentageof irrelevant documents in the buckets is relativelyhigh.