User-Interest-Based document filtering via semi-supervised clustering

  • Authors:
  • Na Tang;V. Rao Vemuri

  • Affiliations:
  • Computer Science Dept., University of California, Davis, Davis, CA;Computer Science Dept., University of California, Davis, Davis, CA

  • Venue:
  • ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper studies the task of user-interest-based document filtering, where users target to find some documents of a specific topic among a large document collection. This is usually done by a text categorization process, which divides all the documents into two categorizes: one containing all the desired documents (called positive documents) and the other containing all the other documents (called negative documents). However, in many cases, some documents among the negative documents are close enough to the positive documents, prompting a re-consideration (called deviating negative documents). Simply treating them as negative documents would deteriorate the categorization accuracy. We modify and extend a semi-supervised clustering method to conduct the categorization. Compared to the original method, our approach incorporates more informative initialization and constraints and in a result leads to better clustering results. The experiments show that our approach retrieves better (sometimes significantly improved) categorization accuracy than the original method in the presence of the deviating negative documents.