Mining Relevant Text from Unlabelled Documents

  • Authors:
  • Daniel Barbará;Carlotta Domeniconi;Ning Kang

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatic classification of documents is an importantarea of research with many applications in the fields of documentsearching, forensics and others. Methods to performclassification of text rely on the existence of a sample of documentswhose class labels are known. However, in manysituations, obtaining this sample may not be an easy (oreven possible) task. In this paper we focus on the classificationof unlabelled documents into two classes: relevant andirrelevant, given a topic of interest. By dividing the set ofdocuments into buckets (for instance, answers returned bydifferent search engines), and using association rule miningto find common sets of words among the buckets, we can efficientlyobtain a sample of documents that has a large percentageof relevant ones. This sample can be used to trainmodels to classify the entire set of documents. We prove, viaexperimentation, that our method is capable of filtering relevantdocuments even in adverse conditions where the percentageof irrelevant documents in the buckets is relativelyhigh.