User-Interest-Based document filtering via semi-supervised clustering

Authors:
Na Tang;V. Rao Vemuri
Affiliations:
Computer Science Dept., University of California, Davis, Davis, CA;Computer Science Dept., University of California, Davis, Davis, CA
Venue:
ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Year:
2005

Citing 7
Cited 0

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Constrained K-means Clustering with Background Knowledge

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Semi-supervised Clustering by Seeding

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Building Text Classifiers Using Positive and Unlabeled Examples

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
A probabilistic framework for semi-supervised clustering

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Web-Based Knowledge Acquisition to Impute Missing Values for Classification

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies the task of user-interest-based document filtering, where users target to find some documents of a specific topic among a large document collection. This is usually done by a text categorization process, which divides all the documents into two categorizes: one containing all the desired documents (called positive documents) and the other containing all the other documents (called negative documents). However, in many cases, some documents among the negative documents are close enough to the positive documents, prompting a re-consideration (called deviating negative documents). Simply treating them as negative documents would deteriorate the categorization accuracy. We modify and extend a semi-supervised clustering method to conduct the categorization. Compared to the original method, our approach incorporates more informative initialization and constraints and in a result leads to better clustering results. The experiments show that our approach retrieves better (sometimes significantly improved) categorization accuracy than the original method in the presence of the deviating negative documents.