Effective semi-supervised document clustering via active learning with instance-level constraints

  • Authors:
  • Weizhong Zhao;Qing He;Huifang Ma;Zhongzhi Shi

  • Affiliations:
  • Chinese Academy of Sciences, The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, 100190, Beijing, China and Graduate University of Chinese Academy of Scien ...;Chinese Academy of Sciences, The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, 100190, Beijing, China and Graduate University of Chinese Academy of Scien ...;Chinese Academy of Sciences, The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, 100190, Beijing, China and Graduate University of Chinese Academy of Scien ...;Chinese Academy of Sciences, The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, 100190, Beijing, China and Graduate University of Chinese Academy of Scien ...

  • Venue:
  • Knowledge and Information Systems
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Semi-supervised document clustering, which takes into account limited supervised data to group unlabeled documents into clusters, has received significant interest recently. Because of getting supervised data may be expensive, it is important to get most informative knowledge to improve the clustering performance. This paper presents a semi-supervised document clustering algorithm and a new method for actively selecting informative instance-level constraints to get improved clustering performance. The semi- supervised document clustering algorithm is a Constrained DBSCAN (Cons-DBSCAN) algorithm, which incorporates instance-level constraints to guide the clustering process in DBSCAN. An active learning approach is proposed to select informative document pairs for obtaining user feedbacks. Experimental results show that Cons-DBSCAN with our proposed active learning approach can improve the clustering performance significantly when given a relatively small amount of constraints.