An active learning framework for semi-supervised document clustering with language modeling

  • Authors:
  • Ruizhang Huang;Wai Lam

  • Affiliations:
  • Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong;Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, Hong Kong

  • Venue:
  • Data & Knowledge Engineering
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper investigates a framework that actively selects informative document pairs for obtaining user feedback for semi-supervised document clustering. A gain-directed document pair selection method that measures how much we can learn by revealing judgments of selected document pairs is designed. We use the estimation of term co-occurrence probabilities as a clue for finding informative document pairs. Term co-occurrence probabilities are considered in the semi-supervised document clustering process to capture term-to-term dependence relationships. In the semi-supervised document clustering, each cluster is represented by a language model. We have conducted extensive experiments on several real-world corpora. The results demonstrate that our proposed framework is effective.