Mining Relevant Text from Unlabelled Documents

Authors:
Daniel Barbará;Carlotta Domeniconi;Ning Kang
Affiliations:
-;-;-
Venue:
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Year:
2003

Citing 3
Cited 3

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning

Building a Text Classifier by a Keyword and Unlabeled Documents

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Building a Text Classifier by a Keyword and Wikipedia Knowledge

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Editorial: Classifying text streams by keywords using classifier ensemble

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic classification of documents is an importantarea of research with many applications in the fields of documentsearching, forensics and others. Methods to performclassification of text rely on the existence of a sample of documentswhose class labels are known. However, in manysituations, obtaining this sample may not be an easy (oreven possible) task. In this paper we focus on the classificationof unlabelled documents into two classes: relevant andirrelevant, given a topic of interest. By dividing the set ofdocuments into buckets (for instance, answers returned bydifferent search engines), and using association rule miningto find common sets of words among the buckets, we can efficientlyobtain a sample of documents that has a large percentageof relevant ones. This sample can be used to trainmodels to classify the entire set of documents. We prove, viaexperimentation, that our method is capable of filtering relevantdocuments even in adverse conditions where the percentageof irrelevant documents in the buckets is relativelyhigh.